[Labs-l] Scripts which adds template to articles created by ContentTranslation tool do not work on the grid

Merlijn van Deen (valhallasw) valhallasw at arctus.nl
Sat Jun 17 09:14:20 UTC 2017


Hi all,

This is a combination of a Python 3 design choice (PEP 383 [1]) and T60786
[2]. What happens is the following:

1) The locale is set to a encoding that cannot decode certain bytes -- for
example, ASCII, which can only decode bytes < 128.
2) Python is started with a command line parameter that contains a byte >
128 (\x80), for example, "ř' when UTF-8 encoded is represented by two
bytes: \xc5\x99. Both of these are > \x80, and can therefore not be
interpreted as ASCII
3) Python3 needs to somehow decode these bytes into a text string. But
there is no valid way to do so! Instead of complaining loudly with a
UnicodeDecodeError, Python3 embeds the bytes as 'fake characters' in the
string -- as described in PEP 383.
\xc5\x59 is therefore now suddenly decoded as "'\udcc5\udc99".  instead of
"ř".
4) Pywikibot tries to encode these characters using utf-8, but they are
fake characters, and the .encode step blows up.

A simple way to reproduce this is the following:

valhallasw at tools-bastion-03:~/ucm$ cat test.py
import sys
encoded = sys.argv[1].encode('utf-8')

valhallasw at tools-bastion-03:~/ucm$ LC_ALL=C python3 test.py řeklad
Traceback (most recent call last):
  File "test.py", line 2, in <module>
    encoded = sys.argv[1].encode('utf-8')
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc5' in
position 0: surrogates not allowed

This should be fixed in future Python versions (likely 3.7), when PEP540
[3] is implemented.

As for the current situation, the simplest solution is to add  'export
LC_ALL=en_US.UTF-8' to your script, before the 'python ...' line.

Best,
Merlijn

[1] https://www.python.org/dev/peps/pep-0383/
[2] https://phabricator.wikimedia.org/T60784
[3] https://www.python.org/dev/peps/pep-0540/


On 16 June 2017 at 23:58, Bryan Davis <bd808 at wikimedia.org> wrote:

> On Fri, Jun 16, 2017 at 10:15 AM, Martin Urbanec
> <martin.urbanec at wikimedia.cz> wrote:
> > Hello,
> >
> > I have a script which should add a template to articles which are
> created by
> > the ContentTranslation tool (the template has parameters which depends on
> > language and revision which were used as the source one; this is the
> reason
> > why I use separate script). It may be found at
> > https://github.com/urbanecm/addPrekladCT/blob/master/addmissing.py. The
> > script work perfectly on my local PC and on bastion host but I can't get
> it
> > work on the grid.
> >
> > The script itself is run by python3 addmissing.py -always -file:pages.txt
> > -search:'-insource:/\{\{[Pp]řeklad/' and require pages.txt file and
> > preklads.txt file at
> > https://tools.wmflabs.org/urbanecmbot/test/preklads.txt. The first
> contains
> > pages that should be processed and act as the generator, the second one
> is
> > something like a database with exact templates which should be inserted.
> > Both files are as an example in the attachments.
> >
> > When I try to run it at toollabs bastion, all works as it should. When I
> > send the script to grid, it do not work (see sample output below). Why?
> Can
> > somebody help me with it?
> >
> > Thank you in advance,
> > Martin Urbanec / Urbanecm
> >
> > ; Output
> >
> > urbanecm at tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> > $ cat test.sh
> > python3 addmissing.py -always -file:pages.txt
> > -search:'-insource:/\{\{[Pp]řeklad/'
> > urbanecm at tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> > $ jsub bash test.sh
> > Your job 6201363 ("bash") has been submitted
> > urbanecm at tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> > $ qstat
> > job-ID  prior   name       user         state submit/start at     queue
> > slots ja-task-ID
> > ------------------------------------------------------------
> -----------------------------------------------------
> > 6201363 0.30000 bash       urbanecm     r     06/16/2017 18:14:42
> > task at tools-exec-1404.eqiad.wmf     1
> > urbanecm at tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> > $ ls ~/bash.*
> > /home/urbanecm/bash.err  /home/urbanecm/bash.out
> > urbanecm at tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> > $ cat ~/bash.*
> > Traceback (most recent call last):
> >   File "addmissing.py", line 223, in <module>
> >     main()
> >   File "addmissing.py", line 183, in main
> >     local_args = pywikibot.handle_args(args)
> >   File "/shared/pywikipedia/core/pywikibot/bot.py", line 954, in
> handle_args
> >     writeToCommandLogFile()
> >   File "/shared/pywikipedia/core/pywikibot/bot.py", line 1128, in
> > writeToCommandLogFile
> >     command_log_file.write(s + os.linesep)
> >   File "/usr/lib/python3.4/codecs.py", line 711, in write
> >     return self.writer.write(data)
> >   File "/usr/lib/python3.4/codecs.py", line 368, in write
> >     data, consumed = self.encode(object, self.errors)
> > UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc5' in
> > position 67: surrogates not allowed
> > CRITICAL: Closing network session.
> > <class 'UnicodeEncodeError'>
> > urbanecm at tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> > $
>
> Zhuyifei1999 saw your email and noted on irc that it looks to be a
> case of the known bug that I just retitled as "Shell LOCALE neither
> consistent nor sane across grid engine nodes"
> (<https://phabricator.wikimedia.org/T60784>). The current best work
> around that bug is to launch the job as a shell script that sets
> either LANG=C.UTF-8 or PYTHONIOENCODING=utf-8.
>
> If setting the job to run with the same locale you are using in your
> interactive tests does not work to fix the problem, you may also be
> hitting a deeper Python3 unicode issue related to surrogate codepoints
> (<https://bugs.python.org/issue12892>). This is hinted by the
> "position 67: surrogates not allowed" error message.
>
> I can actually reproduce your error message in an interactive python
> session on tools-dev from a starting state of LANG=en_US.UTF-8:
>
>   $ python3
>   Python 3.4.0 (default, Jun 19 2015, 14:20:21)
>   [GCC 4.8.2] on linux
>   Type "help", "copyright", "credits" or "license" for more information.
>   >>> print('\udcc5')
>   Traceback (most recent call last):
>     File "<stdin>", line 1, in <module>
>   UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc5' in
> position 0: surrogates not allowed
>   >>>
>
> Explictly encoding using 'surrogateescape' does work:
>   >>> print('\udcc5'.encode('utf-8', 'surrogateescape'))
>   b'\xc5'
>
> It looks like the error could be dealt with in pywikibot by patching
> writeToCommandLogFile() to open the codec used for output with any
> value other than the default errors='strict'
> (<https://docs.python.org/3/library/codecs.html#error-handlers>).
>
>   $ python3
>   Python 3.4.0 (default, Jun 19 2015, 14:20:21)
>   [GCC 4.8.2] on linux
>   Type "help", "copyright", "credits" or "license" for more information.
>   >>> print('\udcc5'.encode('utf-8', 'ignore'))
>   b''
>   >>> print('\udcc5'.encode('utf-8', 'replace'))
>   b'?'
>   >>> print('\udcc5'.encode('utf-8', 'xmlcharrefreplace'))
>   b'�'
>   >>> print('\udcc5'.encode('utf-8', 'backslashreplace'))
>   b'\\udcc5'
>   >>> print('\udcc5'.encode('utf-8', 'surrogateescape'))
>   b'\xc5'
>   >>> print('\udcc5'.encode('utf-8', 'surrogatepass'))
>   b'\xed\xb3\x85'
>   >>>
>
>
> Bryan
> --
> Bryan Davis              Wikimedia Foundation    <bd808 at wikimedia.org>
> [[m:User:BDavis_(WMF)]] Manager, Cloud Services          Boise, ID USA
> irc: bd808                                        v:415.839.6885 x6855
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/labs-l/attachments/20170617/4d495945/attachment-0001.html>


More information about the Labs-l mailing list