[Mediawiki-l] Sudden problem with some greek and cyrillic letters

Sylvain Machefert iubito at gmail.com
Fri May 4 16:56:33 UTC 2007


I've an idea but I don't know how I can do this.

1) I could create a new column in wiki_page, called "page_title_blob", which
is a blob containing the same as "page_title" but in a
blob. This column will not have problem with dump/import.
2) Frequently I run a php script which copies page_title into
page_title_blob. OR I hack the MW code
to insert in page_title_blob on each insert/update of article title (create,
move).
3) In case of dump/import or migration to another MySQL server, the problem
appears. I run a script which copies page_title_blob into page_title.

Can you help me to write the 2 scripts, 'cause I don't know how to
read/write in a blog and page_title with the encoding problem.

Thanks in advance

2007/5/4, Sylvain Machefert <iubito at gmail.com>:
>
> Thanks for your explanation Brion.
> I didn't get an answer from my hosting provider, I asked them if they
> encountered a mysql crash or something else and the dump/import all the
> databases.
> I don't see another solution than the one I'm doing by hand... 140 rows
> are affected.
>
> On greek, russian... wikipedias, do you use mysqldump
> --default-charset=latin1 and did you successfully reimported the dumps ?
> I'll ask my provider if it's possible to add the --default-charset=latin1
> parameter in my automatic weekly dumps, but I'm afraid it couldn't.
> In PhpMyAdmin and SQLYog (Mysql client), I didn't see such option.
>
> 2007/5/4, Brion Vibber <brion at wikimedia.org>:
> >
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> > Sylvain Machefert wrote:
> > > Hi Brion,
> > > what is strange, is that only the titles are affected, not the content
> > of
> > > the pages. Is that normal ?
> >
> > Yes -- the page text is in a binary BLOB field, which will not undergo
> > the bogus lossy conversion.
> >
> > The summarize, the problem is roughly:
> >
> > * MediaWiki assumes that MySQL will preserve data that is put into it
> > * MySQL sometimes corrupts the data
> >
> > in a little more detail:
> >
> > * Due to the limitations of MySQL's Unicode support, but default we
> > continue to treat MySQL fields as binary and store pure UTF-8 Unicode in
> >
> > them, although MySQL may have them listed as Latin-1 depending on your
> > server's defaults.
> >
> > * The mysqldump backup program by default in 4.1 and later applies a
> > conversion of non-binary fields to UTF-8, with a marker to have them
> > appropriately converted back when read in.
> >
> > * This conversion is lossy -- it treats Latin-1 as the Windows-1252 code
> > page, which is an extension of ISO 8859-1 with additional characters in
> > the 128-159 range which in ISO 8859 and Unicode is supposed to contain
> > non-printing control characters. Four of the code points in this range
> > are not assigned in Windows-1252, and so cannot be converted to UTF-8
> > Unicode -- these characters are silently corrupted into "?" characters
> > during the conversion if they appear.
> >
> > * The UTF-8 encoding of Unicode uses the byte values which correspond to
> > those four non-convertible characters.
> >
> > * As a result, UTF-8 text in a Latin-1 field may be corrupted, as some
> > characters are destroyed in the conversion back and forth.
> >
> > Use the --default-charset=latin1 option on mysqldump when creating your
> > database dumps to avoid this lossy conversion. (And/or find another way
> > to dump/copy databases or another equivalent option to avoid the
> > unnecessary conversion.)
> >
> > Since it appears that your hosting provider did this for you, you may
> > need to ask them to redo it. Alternatively, you may be able to rig up a
> > statistical fix based on which characters are being corrupted, though
> > I'm not sure how easy that would be.
> >
> > - -- brion vibber (brion @ wikimedia.org)
> > -----BEGIN PGP SIGNATURE-----
> > Version: GnuPG v1.4.2.2 (Darwin)
> > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> >
> > iD8DBQFGO06ewRnhpk1wk44RAsDXAKCJWzzINvB0TKwsSMQ6s0HNGvompQCg1ESu
> > 4G38Ult52BKTj3Ruq40UtJk=
> > =xVIo
> > -----END PGP SIGNATURE-----
> >
> > _______________________________________________
> > MediaWiki-l mailing list
> > MediaWiki-l at lists.wikimedia.org
> > http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
> >
>
>
>
> --
> Sylvain Machefert
> http://iubito.free.fr
> http://tousauxbalkans.jexiste.fr
>



-- 
Sylvain Machefert
http://iubito.free.fr
http://tousauxbalkans.jexiste.fr


More information about the MediaWiki-l mailing list