Hello,
When trying to use mwdumper to import enwiki-20061001-pages-articles.xml.bz2, I find that using the --output=mysql:... --format=sql:1.5 causes some piece to eat UTF-8. I can verify this in the database: the text table will have data like is:Stj�rnleysisstefna where the middle mojibake character is a single byte.
If I use mwdumper with --output=stdout, I can verify the resulting SQL has good UTF-8 in it. If I run a command like mwdumper --output=stdout --format=sql:1.5 ... | mysql -uwiki -pwiki wikidb things come out ok: the Greek in [[Anarchism]] (which is a useful test article because it occurs early in the dump) displays fine.
Some other info that may be useful: - I have $wgDBmysql5 = false; in my LocalSettings.php. - My locale doesn't seem to affect it, but it's all en_US.UTF-8 in case that matters. - java version "1.5.0_07" / Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_07-b03) - Command line I'm trying is: java -server -classpath /usr/share/java/mysql-3.1.11.jar:mwdumper/bin org.mediawiki.dumper.Dumper "--output=mysql://localhost/wikidb?user=wiki&password=wiki" --format=sql:1.5 enwiki-20061001-pages-articles.xml.bz2
That mysql jarfile comes from libmysql-java on Ubuntu dapper, version 3.1.11-1, but I find the same behavior with mysql-connector-java-5.0.4.
I could file this as a bug, but I wanted to first verify I wasn't doing anything wrong.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Evan Martin wrote:
When trying to use mwdumper to import enwiki-20061001-pages-articles.xml.bz2, I find that using the --output=mysql:... --format=sql:1.5 causes some piece to eat UTF-8. I can verify this in the database: the text table will have data like is:Stj�rnleysisstefna where the middle mojibake character is a single byte.
[snip]
Some other info that may be useful:
- I have $wgDBmysql5 = false; in my LocalSettings.php.
Assuming you're using MySQL 4.1 higher, this is your problem.
When you let Java speak directly to MySQL, it's going to try to speak "real" UTF-8.
However MediaWiki, when $wgDBmysql5 is off, is charset-agnostic to MySQL and just grabs and sends UTF-8 strings without caring what MySQL thinks they are. That may, or may not, be the same as what MySQL thinks is UTF-8 internally.
To tell MediaWiki to declare its connection communication as UTF-8, set $wgDBmysql5 to true. (Make sure the tables actually are defined as UTF-8 if you do this, for instance by using the mysql5/tables.sql schema.)
Unfortunately MySQL 4.1/5.0/5.1 doesn't support UTF-8 fully, so you may end up with some corrupted pages and who knows what other problems if you use the native UTF-8 schema. For instance you may get a number of invalid page titles, which could result in unique key conflicts.
Thanks, MySQL!
For this reason I recommend sticking with MySQL 4.0 if you want to work with Wikimedia data...
If I use mwdumper with --output=stdout, I can verify the resulting SQL has good UTF-8 in it. If I run a command like mwdumper --output=stdout --format=sql:1.5 ... | mysql -uwiki -pwiki
wikidb
things come out ok: the Greek in [[Anarchism]] (which is a useful test article because it occurs early in the dump) displays fine.
You could also continue doing this, which should probably behave in the same default way as MediaWiki with $wgDBmysql5 set to off.
- -- brion vibber (brion @ pobox.com / brion @ wikimedia.org)
Hi,
However MediaWiki, when $wgDBmysql5 is off, is charset-agnostic to MySQL and just grabs and sends UTF-8 strings without caring what MySQL thinks they are. That may, or may not, be the same as what MySQL thinks is UTF-8 internally.
Current problem is that it specifies latin1 character set, where it should be using 'varbinary'/'blob'. JDBC standards ask to associate proper character set with anything what is text.
No charset conversion should happen for binary fields.
Thanks, MySQL!
:-)
For this reason I recommend sticking with MySQL 4.0 if you want to work with Wikimedia data...
Or actually updating schema not to use charset-enabled types anywhere.
Cheers,
On 12/4/06, Brion Vibber brion@pobox.com wrote:
Evan Martin wrote:
When trying to use mwdumper to import enwiki-20061001-pages-articles.xml.bz2, I find that using the --output=mysql:... --format=sql:1.5 causes some piece to eat UTF-8. I can verify this in the database: the text table will have data like is:Stj�rnleysisstefna where the middle mojibake character is a single byte.
[snip]
Some other info that may be useful:
- I have $wgDBmysql5 = false; in my LocalSettings.php.
Assuming you're using MySQL 4.1 higher, this is your problem.
When you let Java speak directly to MySQL, it's going to try to speak "real" UTF-8.
Aha! I'm vaguely familiar with the UTF-8 problem with MySQL, but I couldn't see how it was getting involved here. After sending my mail I had thought $wgDBmysq5 was a red herring, because mwdumper (I assume) doesn't use it, but I realize now that when the tables are *created* that value is used.
To rephrase for anyone else who stumbles across this thread: $wgDBmysql5 = false means that the tables are created with DEFAULT CHARSET=latin1, which isn't especially a problem as long as the software atop it (mediawiki) knows that it's actually storing UTF-8. But when you use the Java library to speak to MySQL, it notices that the table is marked as latin1 and tries to convert your UTF-8 data for you while importing.
I now understand importing via a pipe to MySQL is my best bet.
Thanks for the quick response!
Hi!
To rephrase for anyone else who stumbles across this thread: $wgDBmysql5 = false means that the tables are created with DEFAULT CHARSET=latin1, which isn't especially a problem as long as the software atop it (mediawiki) knows that it's actually storing UTF-8.
I wonder if DEFAULT CHARSET=binary would help here.
But when you use the Java library to speak to MySQL, it notices that the table is marked as latin1 and tries to convert your UTF-8 data for you while importing.
You can hack around that by specifying connectionEncoding=UTF-8 in JDBC params, then execute 'SET NAMES latin1', to avoid any conversions ;-) Oh well, there might be some escaping issues, but let's assume they don't exist.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Domas Mituzas wrote:
To rephrase for anyone else who stumbles across this thread: $wgDBmysql5 = false means that the tables are created with DEFAULT CHARSET=latin1, which isn't especially a problem as long as the software atop it (mediawiki) knows that it's actually storing UTF-8.
I wonder if DEFAULT CHARSET=binary would help here.
Please test the binary schema (mysql5/tables-binary.sql); I think there are some problems.
Behavior with the binary schema appears to be a bit different; CHAR fields get padded with nulls instead of spaces, and the nulls don't get removed on SELECT.
I seem to recall this borked up at least the objectcache table, and it might affect others.
- -- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org