[Mediawiki-l] Sudden problem with some greek andcyrillic letters

Sun May 6 10:37:39 UTC 2007

> * Due to the limitations of MySQL's Unicode support, but default we
continue to treat MySQL fields as binary and store pure UTF-8 Unicode in
them, although MySQL may have them listed as Latin-1 depending on your
server's defaults.

Surely this is a bug?  If MW wants binary fields, then surely it should explicitly create them as binary, instead of leaving it up to some random server default?

Ian

 -----Original Message-----
From: 	Brion Vibber [mailto:brion at wikimedia.org]
Sent:	Friday, May 04, 2007 08:18 AM Pacific Standard Time
To:	MediaWiki announcements and site admin list
Subject:	Re: [Mediawiki-l] Sudden problem with some greek andcyrillic	letters

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Sylvain Machefert wrote:
> Hi Brion,
> what is strange, is that only the titles are affected, not the content of
> the pages. Is that normal ?

Yes -- the page text is in a binary BLOB field, which will not undergo
the bogus lossy conversion.

The summarize, the problem is roughly:

* MediaWiki assumes that MySQL will preserve data that is put into it
* MySQL sometimes corrupts the data

in a little more detail:

* Due to the limitations of MySQL's Unicode support, but default we
continue to treat MySQL fields as binary and store pure UTF-8 Unicode in
them, although MySQL may have them listed as Latin-1 depending on your
server's defaults.

* The mysqldump backup program by default in 4.1 and later applies a
conversion of non-binary fields to UTF-8, with a marker to have them
appropriately converted back when read in.

* This conversion is lossy -- it treats Latin-1 as the Windows-1252 code
page, which is an extension of ISO 8859-1 with additional characters in
the 128-159 range which in ISO 8859 and Unicode is supposed to contain
non-printing control characters. Four of the code points in this range
are not assigned in Windows-1252, and so cannot be converted to UTF-8
Unicode -- these characters are silently corrupted into "?" characters
during the conversion if they appear.

* The UTF-8 encoding of Unicode uses the byte values which correspond to
those four non-convertible characters.

* As a result, UTF-8 text in a Latin-1 field may be corrupted, as some
characters are destroyed in the conversion back and forth.

Use the --default-charset=latin1 option on mysqldump when creating your
database dumps to avoid this lossy conversion. (And/or find another way
to dump/copy databases or another equivalent option to avoid the
unnecessary conversion.)

Since it appears that your hosting provider did this for you, you may
need to ask them to redo it. Alternatively, you may be able to rig up a
statistical fix based on which characters are being corrupted, though
I'm not sure how easy that would be.

- -- brion vibber (brion @ wikimedia.org)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGO06ewRnhpk1wk44RAsDXAKCJWzzINvB0TKwsSMQ6s0HNGvompQCg1ESu
4G38Ult52BKTj3Ruq40UtJk=
=xVIo
-----END PGP SIGNATURE-----

_______________________________________________
MediaWiki-l mailing list
MediaWiki-l at lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/mediawiki-l