https://bugzilla.wikimedia.org/show_bug.cgi?id=58872
Web browser: --- Bug ID: 58872 Summary: Set locale if system uses wrong default Product: Pywikibot Version: core (2.0) Hardware: All OS: All Status: NEW Severity: major Priority: Unprioritized Component: General Assignee: Pywikipedia-bugs@lists.wikimedia.org Reporter: dr.trigon@surfeu.ch Classification: Unclassified Mobile Platform: ---
The grid engine on tool labs has another default locale setting than the console.
Grid engine:
import locale print locale.localeconv()
{'mon_decimal_point': '', 'int_frac_digits': 127, 'p_sep_by_space': 127, 'frac_digits': 127, 'thousands_sep': '', 'n_sign_posn': 127, 'decimal_point': '.', 'int_curr_symbol': '', 'n_cs_precedes': 127, 'p_sign_posn': 127, 'mon_thousands_sep': '', 'negative_sign': '', 'currency_symbol': '', 'n_sep_by_space': 127, 'mon_grouping': [], 'p_cs_precedes': 127, 'positive_sign': '', 'grouping': []}
print locale.getdefaultlocale()
(None, None)
print locale.getlocale()
(None, None)
print locale.getpreferredencoding()
ANSI_X3.4-1968
Console:
import locale print locale.localeconv()
{'mon_decimal_point': '', 'int_frac_digits': 127, 'p_sep_by_space': 127, 'frac_digits': 127, 'thousands_sep': '', 'n_sign_posn': 127, 'decimal_point': '.', 'int_curr_symbol': '', 'n_cs_precedes': 127, 'p_sign_posn': 127, 'mon_thousands_sep': '', 'negative_sign': '', 'currency_symbol': '', 'n_sep_by_space': 127, 'mon_grouping': [], 'p_cs_precedes': 127, 'positive_sign': '', 'grouping': []}
print locale.getdefaultlocale()
('en_US', 'UTF-8')
print locale.getlocale()
(None, None)
print locale.getpreferredencoding()
UTF-8
The one from console works with pywikibot, the other one not, see Bug 58181. Essentially the issue is that the locale on the grid engine is not set properly. But it is not important where this error comes from, the bots must not crash in such situations.
I propose to check 'locale.getdefaultlocale()' on startup and compare it to 'config.textfile_encoding' (may be also 'config.console_encoding') IFF they mismatch, the encoding has to be set according to config in order to use the correct one.
https://bugzilla.wikimedia.org/show_bug.cgi?id=58872
DrTrigon dr.trigon@surfeu.ch changed:
What |Removed |Added ---------------------------------------------------------------------------- Blocks| |58181
https://bugzilla.wikimedia.org/show_bug.cgi?id=58872
Merlijn van Deen valhallasw@arctus.nl changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |INVALID
--- Comment #1 from Merlijn van Deen valhallasw@arctus.nl --- This is not a pywikibot issue, but an issue with your code - as I have explained before. Filenames should *never* be unicode strings -- always byte strings. It's just luck (or rather: a combination of factors that happens to be just right) that it works in the shell.
https://bugzilla.wikimedia.org/show_bug.cgi?id=58872
Marcin Cieślak marcin.cieslak@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |marcin.cieslak@gmail.com
--- Comment #2 from Marcin Cieślak marcin.cieslak@gmail.com --- In my case:
locale.getdefaultlocale()
('pl_PL', 'UTF8')
locale.getpreferredencoding()
'UTF-8'
why is en_US better?
If I create files automatically for example out of article names, I prefer to .encode("utf-8") unicode strings manually without resorting to locale module
https://bugzilla.wikimedia.org/show_bug.cgi?id=58872
DrTrigon dr.trigon@surfeu.ch changed:
What |Removed |Added ---------------------------------------------------------------------------- Version|core (2.0) |compat (1.0)
https://bugzilla.wikimedia.org/show_bug.cgi?id=58872
--- Comment #3 from DrTrigon dr.trigon@surfeu.ch --- (In reply to comment #1)
This is not a pywikibot issue, but an issue with your code - as I have explained before. Filenames should *never* be unicode strings -- always byte strings. It's just luck (or rather: a combination of factors that happens to be just right) that it works in the shell.
As explained I am always very confused by this unicode vs. bytecode stuff - I know the details - I am just mixing it up all the time... so please be patient with me.
I did correct all those errors and issues within my scripts once and thus was not aware (and even more confused) that there are still bugs.
Since you make the impression to be "the expert" on such string issues I am desperately needed your help and might will need it again in future.
I was e.g. enormously confused by the fact that unicode (strings?) do also need an internal representation in python and I always assumed this has to be UTF (8, 16 or 32) thus I was mixing UTF and unicode conceptually. Now I learned about UCS [1] and should have sorted it out:
-(byte)string (ASCII, UTF or else) -unicode (internally UCS)
encode: unicode -> bytestring decode: bytestring -> unicode
[1] http://www.cmlenz.net/archives/2008/07/the-truth-about-unicode-in-python
Please correct me if I did say something wrong (again ;) ...
btw.: using 'UTF' locale on tool labs grid engine would not be the correct solution, but it would have helped and not do any harm anyway so I don't see the with that problem there... but this is not an issue anymore... ;)
Greetings
pywikipedia-bugs@lists.wikimedia.org