[Mediawiki-l] Sudden problem with some greek and cyrillic letters

Fri May 4 16:12:15 UTC 2007

Thanks for your explanation Brion.
I didn't get an answer from my hosting provider, I asked them if they
encountered a mysql crash or something else and the dump/import all the
databases.
I don't see another solution than the one I'm doing by hand... 140 rows are
affected.

On greek, russian... wikipedias, do you use mysqldump
--default-charset=latin1 and did you successfully reimported the dumps ?
I'll ask my provider if it's possible to add the --default-charset=latin1
parameter in my automatic weekly dumps, but I'm afraid it couldn't.
In PhpMyAdmin and SQLYog (Mysql client), I didn't see such option.

2007/5/4, Brion Vibber <brion at wikimedia.org>:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Sylvain Machefert wrote:
> > Hi Brion,
> > what is strange, is that only the titles are affected, not the content
> of
> > the pages. Is that normal ?
>
> Yes -- the page text is in a binary BLOB field, which will not undergo
> the bogus lossy conversion.
>
> The summarize, the problem is roughly:
>
> * MediaWiki assumes that MySQL will preserve data that is put into it
> * MySQL sometimes corrupts the data
>
> in a little more detail:
>
> * Due to the limitations of MySQL's Unicode support, but default we
> continue to treat MySQL fields as binary and store pure UTF-8 Unicode in
> them, although MySQL may have them listed as Latin-1 depending on your
> server's defaults.
>
> * The mysqldump backup program by default in 4.1 and later applies a
> conversion of non-binary fields to UTF-8, with a marker to have them
> appropriately converted back when read in.
>
> * This conversion is lossy -- it treats Latin-1 as the Windows-1252 code
> page, which is an extension of ISO 8859-1 with additional characters in
> the 128-159 range which in ISO 8859 and Unicode is supposed to contain
> non-printing control characters. Four of the code points in this range
> are not assigned in Windows-1252, and so cannot be converted to UTF-8
> Unicode -- these characters are silently corrupted into "?" characters
> during the conversion if they appear.
>
> * The UTF-8 encoding of Unicode uses the byte values which correspond to
> those four non-convertible characters.
>
> * As a result, UTF-8 text in a Latin-1 field may be corrupted, as some
> characters are destroyed in the conversion back and forth.
>
> Use the --default-charset=latin1 option on mysqldump when creating your
> database dumps to avoid this lossy conversion. (And/or find another way
> to dump/copy databases or another equivalent option to avoid the
> unnecessary conversion.)
>
> Since it appears that your hosting provider did this for you, you may
> need to ask them to redo it. Alternatively, you may be able to rig up a
> statistical fix based on which characters are being corrupted, though
> I'm not sure how easy that would be.
>
> - -- brion vibber (brion @ wikimedia.org)
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (Darwin)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFGO06ewRnhpk1wk44RAsDXAKCJWzzINvB0TKwsSMQ6s0HNGvompQCg1ESu
> 4G38Ult52BKTj3Ruq40UtJk=
> =xVIo
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> MediaWiki-l mailing list
> MediaWiki-l at lists.wikimedia.org
> http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
>

-- 
Sylvain Machefert
http://iubito.free.fr
http://tousauxbalkans.jexiste.fr