XZise added a comment.
Ah I think no need for that, because I think I know what is happening:
from __future__ import unicode_literals import re re.sub('(?is)A', '', 'Ö'.encode('latin1'))
'\xd6'
re.sub('(?is)A', '', 'ÖA'.encode('latin1'))
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/xzise/.pyenv/versions/2.7.8/lib/python2.7/re.py", line 151, in sub return _compile(pattern, flags).sub(repl, string, count) UnicodeDecodeError: 'ascii' codec can't decode byte 0xd6 in position 0: ordinal not in range(128)
The error only appears when it actually replaces anything. In my previous examples it didn't replaced anything and it worked. But when it replaces something it tries to put the unicode into the bytes which doesn't work. You could test and verify that when you edit the line where the error happens (from your previous errors that is "core/scripts/reflinks.py" in line 647). Currently it looks like this:
linkedpagetext = self.NON_HTML.sub('', linkedpagetext)
But it should work when it looks like this:
linkedpagetext = self.NON_HTML.sub(str(''), linkedpagetext)
I need to figure out if `linkedpagetext` is also `bytes` in Python 3 but that fix will work at least in Python 2.
TASK DETAIL https://phabricator.wikimedia.org/T94688
REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign <username>.
EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: XZise Cc: Ricordisamoa, jayvdb, XZise, Aklapper, Rubin16, pywikipedia-bugs