2010/11/25 Maciej Jaros egil@wp.pl:
@2010-11-26 03:33, MZMcBride:
Maciej Jaros wrote:
There seems to be a problem with character encoding in (at least) the Polish Wikipedia database. At first I thought it was the problem with my script but phpMyAdmin and even shell mysql also shows weird characters.
See for example: SELECT page_title FROM page WHERE page_id = 2117937
Looks fine to me:
mzmcbride@nightshade:~$ mysql -hsql-s2 -e 'SELECT page_title FROM page WHERE page_id = 2117937;' plwiki_p; +---------------------+ | page_title | +---------------------+ | Vladimír_Železný | +---------------------+
Or a lot more here: http://toolserver.org/~eccenux/dna/index.php?D=2010-10-10
The database is set to use latin-1 encoding / collation, but the text of page titles is stored in the database as byte strings. In this specific case, it looks like your tool is mishandling the data.
In general, you want to make sure that the web server is outputting "Content-Type: text/html;charset=utf-8" in its headers. You also want to make sure that your browser is set to use UTF-8 encoding when viewing pages (which it will usually properly auto-detect if the headers are correct) and that the tool you've written properly encodes the byte strings as UTF-8.
When it's a choice between the database being corrupt and user error, the odds favor user error. ;-)
Database corruption can be a user error too ;-).
I'm seeing the same result in three tools this is not just something in my script.
PuTTy on Windows 7 gives me:
eccenux@nightshade:~$ mysql -hsql-s2 -e 'SELECT page_title FROM page WHERE > page_id = 2117937;' plwiki_p; +---------------------+ | page_title | +---------------------+ | VladimĂr_Ĺ˝eleznĂ˝ | +---------------------+
And the same with my script and phpMyAdmin that is on with default settings... Is this something with my profile settings or what?
Best, Nux.
Have you set putty or your web browser (when viewing phpMyAdmin) to use UTF-8?
-Jeremy