Hi.
There seems to be a problem with character encoding in (at least) the Polish Wikipedia database. At first I thought it was the problem with my script but phpMyAdmin and even shell mysql also shows weird characters.
See for example: SELECT page_title FROM page WHERE page_id = 2117937
Or a lot more here: http://toolserver.org/~eccenux/dna/index.php?D=2010-10-10
Regards, Nux.
Maciej Jaros wrote:
There seems to be a problem with character encoding in (at least) the Polish Wikipedia database. At first I thought it was the problem with my script but phpMyAdmin and even shell mysql also shows weird characters.
See for example: SELECT page_title FROM page WHERE page_id = 2117937
Looks fine to me:
mzmcbride@nightshade:~$ mysql -hsql-s2 -e 'SELECT page_title FROM page WHERE page_id = 2117937;' plwiki_p; +---------------------+ | page_title | +---------------------+ | Vladimír_Železný | +---------------------+
Or a lot more here: http://toolserver.org/~eccenux/dna/index.php?D=2010-10-10
The database is set to use latin-1 encoding / collation, but the text of page titles is stored in the database as byte strings. In this specific case, it looks like your tool is mishandling the data.
In general, you want to make sure that the web server is outputting "Content-Type: text/html;charset=utf-8" in its headers. You also want to make sure that your browser is set to use UTF-8 encoding when viewing pages (which it will usually properly auto-detect if the headers are correct) and that the tool you've written properly encodes the byte strings as UTF-8.
When it's a choice between the database being corrupt and user error, the odds favor user error. ;-)
MZMcBride
@2010-11-26 03:33, MZMcBride:
Maciej Jaros wrote:
There seems to be a problem with character encoding in (at least) the Polish Wikipedia database. At first I thought it was the problem with my script but phpMyAdmin and even shell mysql also shows weird characters.
See for example: SELECT page_title FROM page WHERE page_id = 2117937
Looks fine to me:
mzmcbride@nightshade:~$ mysql -hsql-s2 -e 'SELECT page_title FROM page WHERE page_id = 2117937;' plwiki_p; +---------------------+ | page_title | +---------------------+ | Vladimír_Železný | +---------------------+
Or a lot more here: http://toolserver.org/~eccenux/dna/index.php?D=2010-10-10
The database is set to use latin-1 encoding / collation, but the text of page titles is stored in the database as byte strings. In this specific case, it looks like your tool is mishandling the data.
In general, you want to make sure that the web server is outputting "Content-Type: text/html;charset=utf-8" in its headers. You also want to make sure that your browser is set to use UTF-8 encoding when viewing pages (which it will usually properly auto-detect if the headers are correct) and that the tool you've written properly encodes the byte strings as UTF-8.
When it's a choice between the database being corrupt and user error, the odds favor user error. ;-)
Database corruption can be a user error too ;-).
I'm seeing the same result in three tools this is not just something in my script.
PuTTy on Windows 7 gives me:
eccenux@nightshade:~$ mysql -hsql-s2 -e 'SELECT page_title FROM page WHERE
page_id = 2117937;' plwiki_p;
+---------------------+ | page_title | +---------------------+ | VladimĂr_Ĺ˝eleznĂ˝ | +---------------------+
And the same with my script and phpMyAdmin that is on with default settings... Is this something with my profile settings or what?
Best, Nux.
2010/11/25 Maciej Jaros egil@wp.pl:
@2010-11-26 03:33, MZMcBride:
Maciej Jaros wrote:
There seems to be a problem with character encoding in (at least) the Polish Wikipedia database. At first I thought it was the problem with my script but phpMyAdmin and even shell mysql also shows weird characters.
See for example: SELECT page_title FROM page WHERE page_id = 2117937
Looks fine to me:
mzmcbride@nightshade:~$ mysql -hsql-s2 -e 'SELECT page_title FROM page WHERE page_id = 2117937;' plwiki_p; +---------------------+ | page_title | +---------------------+ | Vladimír_Železný | +---------------------+
Or a lot more here: http://toolserver.org/~eccenux/dna/index.php?D=2010-10-10
The database is set to use latin-1 encoding / collation, but the text of page titles is stored in the database as byte strings. In this specific case, it looks like your tool is mishandling the data.
In general, you want to make sure that the web server is outputting "Content-Type: text/html;charset=utf-8" in its headers. You also want to make sure that your browser is set to use UTF-8 encoding when viewing pages (which it will usually properly auto-detect if the headers are correct) and that the tool you've written properly encodes the byte strings as UTF-8.
When it's a choice between the database being corrupt and user error, the odds favor user error. ;-)
Database corruption can be a user error too ;-).
I'm seeing the same result in three tools this is not just something in my script.
PuTTy on Windows 7 gives me:
eccenux@nightshade:~$ mysql -hsql-s2 -e 'SELECT page_title FROM page WHERE > page_id = 2117937;' plwiki_p; +---------------------+ | page_title | +---------------------+ | VladimĂr_Ĺ˝eleznĂ˝ | +---------------------+
And the same with my script and phpMyAdmin that is on with default settings... Is this something with my profile settings or what?
Best, Nux.
Have you set putty or your web browser (when viewing phpMyAdmin) to use UTF-8?
-Jeremy
@2010-11-26 04:19, Jeremy Baron:
Have you set putty or your web browser (when viewing phpMyAdmin) to use UTF-8?
Setting up putty to translate from UTF8 worked, but I haven't managed to figure out how to setup phpMyAdmin on the toolserver. I've tried various connection collations and nothing worked (beside explicit casting in the select statement).
Best, Nux.
@2010-11-26 04:00, Maciej Jaros:
@2010-11-26 03:33, MZMcBride:
Maciej Jaros wrote:
There seems to be a problem with character encoding in (at least) the Polish Wikipedia database. At first I thought it was the problem with my script but phpMyAdmin and even shell mysql also shows weird characters.
See for example: SELECT page_title FROM page WHERE page_id = 2117937
Looks fine to me:
mzmcbride@nightshade:~$ mysql -hsql-s2 -e 'SELECT page_title FROM page WHERE page_id = 2117937;' plwiki_p; +---------------------+ | page_title | +---------------------+ | Vladimír_Železný | +---------------------+
Or a lot more here: http://toolserver.org/~eccenux/dna/index.php?D=2010-10-10
The database is set to use latin-1 encoding / collation, but the text of page titles is stored in the database as byte strings. In this specific case, it looks like your tool is mishandling the data.
In general, you want to make sure that the web server is outputting "Content-Type: text/html;charset=utf-8" in its headers. You also want to make sure that your browser is set to use UTF-8 encoding when viewing pages (which it will usually properly auto-detect if the headers are correct) and that the tool you've written properly encodes the byte strings as UTF-8.
When it's a choice between the database being corrupt and user error, the odds favor user error. ;-)
Database corruption can be a user error too ;-).
I'm seeing the same result in three tools this is not just something in my script.
PuTTy on Windows 7 gives me:
eccenux@nightshade:~$ mysql -hsql-s2 -e 'SELECT page_title FROM page WHERE
page_id = 2117937;' plwiki_p;
+---------------------+ | page_title | +---------------------+ | VladimĂr_Ĺ˝eleznĂ˝ | +---------------------+
And the same with my script and phpMyAdmin that is on with default settings... Is this something with my profile settings or what?
Strange. I wasn't able to make phpMyAdmin act as expected otherwise then casting page_title as binary. Shell act the same to me but I guess it might be because my system doesn't use latin1.
I guess using this in my script was NOT a good idea: PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES utf8" :-).
Hope it helps someone else.
Best, Nux.
Maciej Jaros wrote:
Strange. I wasn't able to make phpMyAdmin act as expected otherwise then casting page_title as binary. Shell act the same to me but I guess it might be because my system doesn't use latin1.
I guess using this in my script was NOT a good idea: PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES utf8" :-).
Hope it helps someone else.
Best, Nux.
It is not for WMF tables. That's like setting $wgDBmysql5 = true; in MediaWiki, which WMF doesn't do.
toolserver-l@lists.wikimedia.org