Hello,
When trying to use mwdumper to import enwiki-20061001-pages-articles.xml.bz2, I find that using the --output=mysql:... --format=sql:1.5 causes some piece to eat UTF-8. I can verify this in the database: the text table will have data like is:Stj�rnleysisstefna where the middle mojibake character is a single byte.
If I use mwdumper with --output=stdout, I can verify the resulting SQL has good UTF-8 in it. If I run a command like mwdumper --output=stdout --format=sql:1.5 ... | mysql -uwiki -pwiki wikidb things come out ok: the Greek in [[Anarchism]] (which is a useful test article because it occurs early in the dump) displays fine.
Some other info that may be useful: - I have $wgDBmysql5 = false; in my LocalSettings.php. - My locale doesn't seem to affect it, but it's all en_US.UTF-8 in case that matters. - java version "1.5.0_07" / Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_07-b03) - Command line I'm trying is: java -server -classpath /usr/share/java/mysql-3.1.11.jar:mwdumper/bin org.mediawiki.dumper.Dumper "--output=mysql://localhost/wikidb?user=wiki&password=wiki" --format=sql:1.5 enwiki-20061001-pages-articles.xml.bz2
That mysql jarfile comes from libmysql-java on Ubuntu dapper, version 3.1.11-1, but I find the same behavior with mysql-connector-java-5.0.4.
I could file this as a bug, but I wanted to first verify I wasn't doing anything wrong.
wikitech-l@lists.wikimedia.org