[Labs-l] Scripts which adds template to articles created by ContentTranslation tool do not work on the grid
Bryan Davis
bd808 at wikimedia.org
Fri Jun 16 21:58:01 UTC 2017
On Fri, Jun 16, 2017 at 10:15 AM, Martin Urbanec
<martin.urbanec at wikimedia.cz> wrote:
> Hello,
>
> I have a script which should add a template to articles which are created by
> the ContentTranslation tool (the template has parameters which depends on
> language and revision which were used as the source one; this is the reason
> why I use separate script). It may be found at
> https://github.com/urbanecm/addPrekladCT/blob/master/addmissing.py. The
> script work perfectly on my local PC and on bastion host but I can't get it
> work on the grid.
>
> The script itself is run by python3 addmissing.py -always -file:pages.txt
> -search:'-insource:/\{\{[Pp]řeklad/' and require pages.txt file and
> preklads.txt file at
> https://tools.wmflabs.org/urbanecmbot/test/preklads.txt. The first contains
> pages that should be processed and act as the generator, the second one is
> something like a database with exact templates which should be inserted.
> Both files are as an example in the attachments.
>
> When I try to run it at toollabs bastion, all works as it should. When I
> send the script to grid, it do not work (see sample output below). Why? Can
> somebody help me with it?
>
> Thank you in advance,
> Martin Urbanec / Urbanecm
>
> ; Output
>
> urbanecm at tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> $ cat test.sh
> python3 addmissing.py -always -file:pages.txt
> -search:'-insource:/\{\{[Pp]řeklad/'
> urbanecm at tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> $ jsub bash test.sh
> Your job 6201363 ("bash") has been submitted
> urbanecm at tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> $ qstat
> job-ID prior name user state submit/start at queue
> slots ja-task-ID
> -----------------------------------------------------------------------------------------------------------------
> 6201363 0.30000 bash urbanecm r 06/16/2017 18:14:42
> task at tools-exec-1404.eqiad.wmf 1
> urbanecm at tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> $ ls ~/bash.*
> /home/urbanecm/bash.err /home/urbanecm/bash.out
> urbanecm at tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> $ cat ~/bash.*
> Traceback (most recent call last):
> File "addmissing.py", line 223, in <module>
> main()
> File "addmissing.py", line 183, in main
> local_args = pywikibot.handle_args(args)
> File "/shared/pywikipedia/core/pywikibot/bot.py", line 954, in handle_args
> writeToCommandLogFile()
> File "/shared/pywikipedia/core/pywikibot/bot.py", line 1128, in
> writeToCommandLogFile
> command_log_file.write(s + os.linesep)
> File "/usr/lib/python3.4/codecs.py", line 711, in write
> return self.writer.write(data)
> File "/usr/lib/python3.4/codecs.py", line 368, in write
> data, consumed = self.encode(object, self.errors)
> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc5' in
> position 67: surrogates not allowed
> CRITICAL: Closing network session.
> <class 'UnicodeEncodeError'>
> urbanecm at tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> $
Zhuyifei1999 saw your email and noted on irc that it looks to be a
case of the known bug that I just retitled as "Shell LOCALE neither
consistent nor sane across grid engine nodes"
(<https://phabricator.wikimedia.org/T60784>). The current best work
around that bug is to launch the job as a shell script that sets
either LANG=C.UTF-8 or PYTHONIOENCODING=utf-8.
If setting the job to run with the same locale you are using in your
interactive tests does not work to fix the problem, you may also be
hitting a deeper Python3 unicode issue related to surrogate codepoints
(<https://bugs.python.org/issue12892>). This is hinted by the
"position 67: surrogates not allowed" error message.
I can actually reproduce your error message in an interactive python
session on tools-dev from a starting state of LANG=en_US.UTF-8:
$ python3
Python 3.4.0 (default, Jun 19 2015, 14:20:21)
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> print('\udcc5')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc5' in
position 0: surrogates not allowed
>>>
Explictly encoding using 'surrogateescape' does work:
>>> print('\udcc5'.encode('utf-8', 'surrogateescape'))
b'\xc5'
It looks like the error could be dealt with in pywikibot by patching
writeToCommandLogFile() to open the codec used for output with any
value other than the default errors='strict'
(<https://docs.python.org/3/library/codecs.html#error-handlers>).
$ python3
Python 3.4.0 (default, Jun 19 2015, 14:20:21)
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> print('\udcc5'.encode('utf-8', 'ignore'))
b''
>>> print('\udcc5'.encode('utf-8', 'replace'))
b'?'
>>> print('\udcc5'.encode('utf-8', 'xmlcharrefreplace'))
b''
>>> print('\udcc5'.encode('utf-8', 'backslashreplace'))
b'\\udcc5'
>>> print('\udcc5'.encode('utf-8', 'surrogateescape'))
b'\xc5'
>>> print('\udcc5'.encode('utf-8', 'surrogatepass'))
b'\xed\xb3\x85'
>>>
Bryan
--
Bryan Davis Wikimedia Foundation <bd808 at wikimedia.org>
[[m:User:BDavis_(WMF)]] Manager, Cloud Services Boise, ID USA
irc: bd808 v:415.839.6885 x6855
More information about the Labs-l
mailing list