- Due to the limitations of MySQL's Unicode support, but default we
continue to treat MySQL fields as binary and store pure UTF-8 Unicode in them, although MySQL may have them listed as Latin-1 depending on your server's defaults.
Surely this is a bug? If MW wants binary fields, then surely it should explicitly create them as binary, instead of leaving it up to some random server default?
Ian
-----Original Message----- From: Brion Vibber [mailto:brion@wikimedia.org] Sent: Friday, May 04, 2007 08:18 AM Pacific Standard Time To: MediaWiki announcements and site admin list Subject: Re: [Mediawiki-l] Sudden problem with some greek andcyrillic letters
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Sylvain Machefert wrote:
Hi Brion, what is strange, is that only the titles are affected, not the content of the pages. Is that normal ?
Yes -- the page text is in a binary BLOB field, which will not undergo the bogus lossy conversion.
The summarize, the problem is roughly:
* MediaWiki assumes that MySQL will preserve data that is put into it * MySQL sometimes corrupts the data
in a little more detail:
* Due to the limitations of MySQL's Unicode support, but default we continue to treat MySQL fields as binary and store pure UTF-8 Unicode in them, although MySQL may have them listed as Latin-1 depending on your server's defaults.
* The mysqldump backup program by default in 4.1 and later applies a conversion of non-binary fields to UTF-8, with a marker to have them appropriately converted back when read in.
* This conversion is lossy -- it treats Latin-1 as the Windows-1252 code page, which is an extension of ISO 8859-1 with additional characters in the 128-159 range which in ISO 8859 and Unicode is supposed to contain non-printing control characters. Four of the code points in this range are not assigned in Windows-1252, and so cannot be converted to UTF-8 Unicode -- these characters are silently corrupted into "?" characters during the conversion if they appear.
* The UTF-8 encoding of Unicode uses the byte values which correspond to those four non-convertible characters.
* As a result, UTF-8 text in a Latin-1 field may be corrupted, as some characters are destroyed in the conversion back and forth.
Use the --default-charset=latin1 option on mysqldump when creating your database dumps to avoid this lossy conversion. (And/or find another way to dump/copy databases or another equivalent option to avoid the unnecessary conversion.)
Since it appears that your hosting provider did this for you, you may need to ask them to redo it. Alternatively, you may be able to rig up a statistical fix based on which characters are being corrupted, though I'm not sure how easy that would be.
- -- brion vibber (brion @ wikimedia.org)
_______________________________________________ MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Ian Smith wrote:
- Due to the limitations of MySQL's Unicode support, but default we
continue to treat MySQL fields as binary and store pure UTF-8 Unicode in them, although MySQL may have them listed as Latin-1 depending on your server's defaults.
Surely this is a bug? If MW wants binary fields, then surely it should explicitly create them as binary, instead of leaving it up to some random server default?
In MySQL 4.0, there were no table or column character sets, there was only a server character set. You could specify a "binary" modifier on columns, altering the collation, which we duly did. Our 4.0-compatible schema thus uses binary collations for varchar columns, but does not specify a character set, since there was no way to do that in MySQL 4.0.
As of MediaWiki 1.9, there is an installer option to select a "MySQL 5 binary" schema, which does specify a binary character set.
-- Tim Starling
Hi Tim, is "Mysql 5 binary" schema working on MySQL 4.1-11 ? is there a way to upgrade the datas to this better schema ?
2007/5/6, Tim Starling tstarling@wikimedia.org:
As of MediaWiki 1.9, there is an installer option to select a "MySQL 5 binary" schema, which does specify a binary character set.
-- Tim Starling
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Sylvain Machefert wrote:
Hi Tim, is "Mysql 5 binary" schema working on MySQL 4.1-11 ?
In theory, however it may have some issues as it is not well tested.
is there a way to upgrade the datas to this better schema ?
We don't have an automated conversion worked out at this time, but you could probably do it with appropriate ALTER TABLEs. Might need some sort of conversion applied somewhere. Let us know if you work it out. :)
- -- brion vibber (brion @ wikimedia.org)
I try the other idea I had. - I create a page_title_blob field in wiki_page table, which is a binary blob. - sometimes, a query update wiki_page set page_title_blob=page_title In case of problem - a query update wiki_page set page_title=page_title_blob
If the binary blob is well dumped/imported, then a simple query could repair page_title field.
In a first time I planned to do it with complex php script, but Database class is a bit obscure for me, so I tried with queries in PhpMyAdmin, and it seems to work.
2007/5/7, Brion Vibber brion@wikimedia.org:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Sylvain Machefert wrote:
Hi Tim, is "Mysql 5 binary" schema working on MySQL 4.1-11 ?
In theory, however it may have some issues as it is not well tested.
is there a way to upgrade the datas to this better schema ?
We don't have an automated conversion worked out at this time, but you could probably do it with appropriate ALTER TABLEs. Might need some sort of conversion applied somewhere. Let us know if you work it out. :)
- -- brion vibber (brion @ wikimedia.org)
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFGPznrwRnhpk1wk44RAo2sAKCpaF9Ws+b8kD93XPMUwgrFazwOSwCgxgWa whyz/6ESdH0GSXjCBWDW17E= =TfQS -----END PGP SIGNATURE-----
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
mediawiki-l@lists.wikimedia.org