[Labs-l] Scripts which adds template to articles created by ContentTranslation tool do not work on the grid

Bryan Davis bd808 at wikimedia.org
Fri Jun 16 21:58:01 UTC 2017


On Fri, Jun 16, 2017 at 10:15 AM, Martin Urbanec
<martin.urbanec at wikimedia.cz> wrote:
> Hello,
>
> I have a script which should add a template to articles which are created by
> the ContentTranslation tool (the template has parameters which depends on
> language and revision which were used as the source one; this is the reason
> why I use separate script). It may be found at
> https://github.com/urbanecm/addPrekladCT/blob/master/addmissing.py. The
> script work perfectly on my local PC and on bastion host but I can't get it
> work on the grid.
>
> The script itself is run by python3 addmissing.py -always -file:pages.txt
> -search:'-insource:/\{\{[Pp]řeklad/' and require pages.txt file and
> preklads.txt file at
> https://tools.wmflabs.org/urbanecmbot/test/preklads.txt. The first contains
> pages that should be processed and act as the generator, the second one is
> something like a database with exact templates which should be inserted.
> Both files are as an example in the attachments.
>
> When I try to run it at toollabs bastion, all works as it should. When I
> send the script to grid, it do not work (see sample output below). Why? Can
> somebody help me with it?
>
> Thank you in advance,
> Martin Urbanec / Urbanecm
>
> ; Output
>
> urbanecm at tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> $ cat test.sh
> python3 addmissing.py -always -file:pages.txt
> -search:'-insource:/\{\{[Pp]řeklad/'
> urbanecm at tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> $ jsub bash test.sh
> Your job 6201363 ("bash") has been submitted
> urbanecm at tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> $ qstat
> job-ID  prior   name       user         state submit/start at     queue
> slots ja-task-ID
> -----------------------------------------------------------------------------------------------------------------
> 6201363 0.30000 bash       urbanecm     r     06/16/2017 18:14:42
> task at tools-exec-1404.eqiad.wmf     1
> urbanecm at tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> $ ls ~/bash.*
> /home/urbanecm/bash.err  /home/urbanecm/bash.out
> urbanecm at tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> $ cat ~/bash.*
> Traceback (most recent call last):
>   File "addmissing.py", line 223, in <module>
>     main()
>   File "addmissing.py", line 183, in main
>     local_args = pywikibot.handle_args(args)
>   File "/shared/pywikipedia/core/pywikibot/bot.py", line 954, in handle_args
>     writeToCommandLogFile()
>   File "/shared/pywikipedia/core/pywikibot/bot.py", line 1128, in
> writeToCommandLogFile
>     command_log_file.write(s + os.linesep)
>   File "/usr/lib/python3.4/codecs.py", line 711, in write
>     return self.writer.write(data)
>   File "/usr/lib/python3.4/codecs.py", line 368, in write
>     data, consumed = self.encode(object, self.errors)
> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc5' in
> position 67: surrogates not allowed
> CRITICAL: Closing network session.
> <class 'UnicodeEncodeError'>
> urbanecm at tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> $

Zhuyifei1999 saw your email and noted on irc that it looks to be a
case of the known bug that I just retitled as "Shell LOCALE neither
consistent nor sane across grid engine nodes"
(<https://phabricator.wikimedia.org/T60784>). The current best work
around that bug is to launch the job as a shell script that sets
either LANG=C.UTF-8 or PYTHONIOENCODING=utf-8.

If setting the job to run with the same locale you are using in your
interactive tests does not work to fix the problem, you may also be
hitting a deeper Python3 unicode issue related to surrogate codepoints
(<https://bugs.python.org/issue12892>). This is hinted by the
"position 67: surrogates not allowed" error message.

I can actually reproduce your error message in an interactive python
session on tools-dev from a starting state of LANG=en_US.UTF-8:

  $ python3
  Python 3.4.0 (default, Jun 19 2015, 14:20:21)
  [GCC 4.8.2] on linux
  Type "help", "copyright", "credits" or "license" for more information.
  >>> print('\udcc5')
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc5' in
position 0: surrogates not allowed
  >>>

Explictly encoding using 'surrogateescape' does work:
  >>> print('\udcc5'.encode('utf-8', 'surrogateescape'))
  b'\xc5'

It looks like the error could be dealt with in pywikibot by patching
writeToCommandLogFile() to open the codec used for output with any
value other than the default errors='strict'
(<https://docs.python.org/3/library/codecs.html#error-handlers>).

  $ python3
  Python 3.4.0 (default, Jun 19 2015, 14:20:21)
  [GCC 4.8.2] on linux
  Type "help", "copyright", "credits" or "license" for more information.
  >>> print('\udcc5'.encode('utf-8', 'ignore'))
  b''
  >>> print('\udcc5'.encode('utf-8', 'replace'))
  b'?'
  >>> print('\udcc5'.encode('utf-8', 'xmlcharrefreplace'))
  b'�'
  >>> print('\udcc5'.encode('utf-8', 'backslashreplace'))
  b'\\udcc5'
  >>> print('\udcc5'.encode('utf-8', 'surrogateescape'))
  b'\xc5'
  >>> print('\udcc5'.encode('utf-8', 'surrogatepass'))
  b'\xed\xb3\x85'
  >>>


Bryan
-- 
Bryan Davis              Wikimedia Foundation    <bd808 at wikimedia.org>
[[m:User:BDavis_(WMF)]] Manager, Cloud Services          Boise, ID USA
irc: bd808                                        v:415.839.6885 x6855



More information about the Labs-l mailing list