Maciej Jaros wrote:
There seems to be a problem with character encoding
in (at least) the Polish
Wikipedia database. At first I thought it was the problem with my script but
phpMyAdmin and even shell mysql also shows weird characters.
See for example:
SELECT page_title FROM page WHERE page_id = 2117937
Looks fine to me:
mzmcbride@nightshade:~$ mysql -hsql-s2 -e 'SELECT page_title FROM page WHERE
page_id = 2117937;' plwiki_p;
+---------------------+
| page_title |
+---------------------+
| Vladimír_Železný |
+---------------------+
The database is set to use latin-1 encoding / collation, but the text of
page titles is stored in the database as byte strings. In this specific
case, it looks like your tool is mishandling the data.
In general, you want to make sure that the web server is outputting
"Content-Type: text/html;charset=utf-8" in its headers. You also want to
make sure that your browser is set to use UTF-8 encoding when viewing pages
(which it will usually properly auto-detect if the headers are correct) and
that the tool you've written properly encodes the byte strings as UTF-8.
When it's a choice between the database being corrupt and user error, the
odds favor user error. ;-)
MZMcBride