-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Evan Martin wrote:
When trying to use mwdumper to import
enwiki-20061001-pages-articles.xml.bz2, I find that using the
--output=mysql:... --format=sql:1.5 causes some piece to eat UTF-8. I
can verify this in the database: the text table will have data like
is:Stj�rnleysisstefna where the middle mojibake character is a single
byte.
[snip]
Some other info that may be useful:
- I have $wgDBmysql5 = false; in my LocalSettings.php.
Assuming you're using MySQL 4.1 higher, this is your problem.
When you let Java speak directly to MySQL, it's going to try to speak
"real" UTF-8.
However MediaWiki, when $wgDBmysql5 is off, is charset-agnostic to MySQL
and just grabs and sends UTF-8 strings without caring what MySQL thinks
they are. That may, or may not, be the same as what MySQL thinks is
UTF-8 internally.
To tell MediaWiki to declare its connection communication as UTF-8, set
$wgDBmysql5 to true. (Make sure the tables actually are defined as UTF-8
if you do this, for instance by using the mysql5/tables.sql schema.)
Unfortunately MySQL 4.1/5.0/5.1 doesn't support UTF-8 fully, so you may
end up with some corrupted pages and who knows what other problems if
you use the native UTF-8 schema. For instance you may get a number of
invalid page titles, which could result in unique key conflicts.
Thanks, MySQL!
For this reason I recommend sticking with MySQL 4.0 if you want to work
with Wikimedia data...
If I use mwdumper with --output=stdout, I can verify
the resulting SQL
has good UTF-8 in it. If I run a command like
mwdumper --output=stdout --format=sql:1.5 ... | mysql -uwiki -pwiki
wikidb
things come out ok: the Greek in [[Anarchism]] (which
is a useful test
article because it occurs early in the dump) displays fine.
You could also continue doing this, which should probably behave in the
same default way as MediaWiki with $wgDBmysql5 set to off.
- -- brion vibber (brion @
pobox.com / brion @
wikimedia.org)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (Darwin)
Comment: Using GnuPG with Mozilla -
http://enigmail.mozdev.org
iD8DBQFFdP9FwRnhpk1wk44RAs/oAKC0zgBfmC8jIceb54KmjT4Y6ls5DACfaIuq
dxc3z42azaGff0GQiKANVWY=
=0r7G
-----END PGP SIGNATURE-----