Sudden problem with some greek and cyrillic letters

List overview All Threads
Download

newer

older

Querybane / Servmon

sysop password

Sylvain Machefert

4 May 2007 4 May '07

8:09 a.m.

Hello,

My wiki use greek and cyrillic alphabet (in song lyrics). Since 2 weeks, I didn't modified anything in my MW installation and articles. All was OK.

Today, a lot of article titles with latin diacritics, greek or cyrillic letters have problem. See for example the list of article in page http://tousauxbalkans.jexiste.fr/Bulgarie

My host often upgrade linux core, php or mysql. I only see one of these upgrade as cause of my problem. Today here is my version : http://tousauxbalkans.jexiste.fr/Special:Version * MediaWiki: 1.9.3 * PHP: 5.2.0-8+etch3 (cgi-fcgi) * MySQL: 4.1.11-Debian_4sarge7-log

I'll ask my host owner what he changed recently. Does anyone had this kind of problem ? Thanks in advance,

-- Sylvain Machefert http://iubito.free.fr http://tousauxbalkans.jexiste.fr

Show replies by date

Brion Vibber

4 May 4 May

4:29 p.m.

New subject: [Mediawiki-l] Sudden problem with some greek and cyrillic letters

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Sylvain Machefert wrote:

...

Today, a lot of article titles with latin diacritics, greek or cyrillic letters have problem. See for example the list of article in page http://tousauxbalkans.jexiste.fr/Bulgarie

My host often upgrade linux core, php or mysql. I only see one of these upgrade as cause of my problem. Today here is my version : http://tousauxbalkans.jexiste.fr/Special:Version

MediaWiki: 1.9.3

PHP: 5.2.0-8+etch3 (cgi-fcgi)

MySQL: 4.1.11-Debian_4sarge7-log

You or your host may have corrupted your data by dumping it with mysqldump without the proper options, thus causing data-loss with the lossy charset converstion (latin1->utf8->latin1) which damages the UTF-8 data stored in the fields.

- -- brion vibber (brion @ wikimedia.org)

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGOzUowRnhpk1wk44RAl3QAJ9vUrA/mF47GLkEF8XCZQsmZpDf2wCeL0em 2tMqAIpr9V2hO7uk+L35epw= =R4/i -----END PGP SIGNATURE-----

Sylvain Machefert

6:03 p.m.

New subject: [Mediawiki-l] Sudden problem with some greek and cyrillic letters

Hi Brion, what is strange, is that only the titles are affected, not the content of the pages. Is that normal ?

I made a little script which list the content of erroneous titles, i.e. : ID | utf8_decode | what is in the DB 1036 | Στης πίκ?ώя �Ď юގՏ?όνησα | Î£Ï„Î·Ï‚ Ï€Î¯ÎºÏ?Î±Ï‚ Ï„Î± Î¾ÎµÏ?ÏŒÎ½Î·ÏƒÎ± 1039 | Συννεφιασμένη_Κυ?ώَюڎ | Î£Ï…Î½Î½ÎµÏ†Î¹Î±ÏƒÎ¼ÎÎ½Î·_ÎšÏ…Ï?Î¹Î±ÎºÎ(r) 3597 | Από_β?ώюԏ?ς_ξεκίνησα | Î'Ï€ÏŒ_Î²Ï?Î±Î´Ï?Ï‚_Î¾ÎµÎºÎ¯Î½Î·ÏƒÎ± ... it happens on the greek "ro" which is utf8_encoded by Ï and a "?" in a lozange (�). Now it's coded with Ï and ? which turns back to "?" and lags. In the 3597, there are 2 "ro" which lag 2 times, so the beginning (before ?) and the end (after ?) is OK, between the two "?" is bad. The utf8_decode of what is the DB is bad between the "?" --> ώюԏ Searching for "ξεκίνησα" on my wiki get the result : Από β�?αδ�?ς ξεκίνησα, but the article is not visible

The only thing I found is to update the database, set the title to TMP. In the wiki, rename the page TMP to right name, and delete the TMP. Do you see how can I solve it easier ?

2007/5/4, Brion Vibber brion@wikimedia.org:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Sylvain Machefert wrote:

...
Today, a lot of article titles with latin diacritics, greek or cyrillic letters have problem. See for example the list of article in page http://tousauxbalkans.jexiste.fr/Bulgarie

My host often upgrade linux core, php or mysql. I only see one of these upgrade as cause of my problem. Today here is my version :

http://tousauxbalkans.jexiste.fr/Special:Version

...

MediaWiki: 1.9.3

PHP: 5.2.0-8+etch3 (cgi-fcgi)

MySQL: 4.1.11-Debian_4sarge7-log

You or your host may have corrupted your data by dumping it with mysqldump without the proper options, thus causing data-loss with the lossy charset converstion (latin1->utf8->latin1) which damages the UTF-8 data stored in the fields.

-- brion vibber (brion @ wikimedia.org)

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGOzUowRnhpk1wk44RAl3QAJ9vUrA/mF47GLkEF8XCZQsmZpDf2wCeL0em 2tMqAIpr9V2hO7uk+L35epw= =R4/i -----END PGP SIGNATURE-----

MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l

-- Sylvain Machefert - en Roumanie du 25 avril au 2 mai - http://iubito.free.fr http://tousauxbalkans.jexiste.fr

Brion Vibber

6:17 p.m.

New subject: [Mediawiki-l] Sudden problem with some greek and cyrillic letters

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Sylvain Machefert wrote:

...

Hi Brion, what is strange, is that only the titles are affected, not the content of the pages. Is that normal ?

Yes -- the page text is in a binary BLOB field, which will not undergo the bogus lossy conversion.

The summarize, the problem is roughly:

* MediaWiki assumes that MySQL will preserve data that is put into it * MySQL sometimes corrupts the data

in a little more detail:

* Due to the limitations of MySQL's Unicode support, but default we continue to treat MySQL fields as binary and store pure UTF-8 Unicode in them, although MySQL may have them listed as Latin-1 depending on your server's defaults.

* The mysqldump backup program by default in 4.1 and later applies a conversion of non-binary fields to UTF-8, with a marker to have them appropriately converted back when read in.

* This conversion is lossy -- it treats Latin-1 as the Windows-1252 code page, which is an extension of ISO 8859-1 with additional characters in the 128-159 range which in ISO 8859 and Unicode is supposed to contain non-printing control characters. Four of the code points in this range are not assigned in Windows-1252, and so cannot be converted to UTF-8 Unicode -- these characters are silently corrupted into "?" characters during the conversion if they appear.

* The UTF-8 encoding of Unicode uses the byte values which correspond to those four non-convertible characters.

* As a result, UTF-8 text in a Latin-1 field may be corrupted, as some characters are destroyed in the conversion back and forth.

Use the --default-charset=latin1 option on mysqldump when creating your database dumps to avoid this lossy conversion. (And/or find another way to dump/copy databases or another equivalent option to avoid the unnecessary conversion.)

Since it appears that your hosting provider did this for you, you may need to ask them to redo it. Alternatively, you may be able to rig up a statistical fix based on which characters are being corrupted, though I'm not sure how easy that would be.

- -- brion vibber (brion @ wikimedia.org)

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGO06ewRnhpk1wk44RAsDXAKCJWzzINvB0TKwsSMQ6s0HNGvompQCg1ESu 4G38Ult52BKTj3Ruq40UtJk= =xVIo -----END PGP SIGNATURE-----

Sylvain Machefert

7:12 p.m.

New subject: [Mediawiki-l] Sudden problem with some greek and cyrillic letters

Thanks for your explanation Brion. I didn't get an answer from my hosting provider, I asked them if they encountered a mysql crash or something else and the dump/import all the databases. I don't see another solution than the one I'm doing by hand... 140 rows are affected.

On greek, russian... wikipedias, do you use mysqldump --default-charset=latin1 and did you successfully reimported the dumps ? I'll ask my provider if it's possible to add the --default-charset=latin1 parameter in my automatic weekly dumps, but I'm afraid it couldn't. In PhpMyAdmin and SQLYog (Mysql client), I didn't see such option.

2007/5/4, Brion Vibber brion@wikimedia.org:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Sylvain Machefert wrote:

...
Hi Brion, what is strange, is that only the titles are affected, not the content

of

...
the pages. Is that normal ?

Yes -- the page text is in a binary BLOB field, which will not undergo the bogus lossy conversion.

The summarize, the problem is roughly:

MediaWiki assumes that MySQL will preserve data that is put into it

MySQL sometimes corrupts the data

in a little more detail:

Due to the limitations of MySQL's Unicode support, but default we

continue to treat MySQL fields as binary and store pure UTF-8 Unicode in them, although MySQL may have them listed as Latin-1 depending on your server's defaults.

The mysqldump backup program by default in 4.1 and later applies a

conversion of non-binary fields to UTF-8, with a marker to have them appropriately converted back when read in.

This conversion is lossy -- it treats Latin-1 as the Windows-1252 code

page, which is an extension of ISO 8859-1 with additional characters in the 128-159 range which in ISO 8859 and Unicode is supposed to contain non-printing control characters. Four of the code points in this range are not assigned in Windows-1252, and so cannot be converted to UTF-8 Unicode -- these characters are silently corrupted into "?" characters during the conversion if they appear.

The UTF-8 encoding of Unicode uses the byte values which correspond to

those four non-convertible characters.

As a result, UTF-8 text in a Latin-1 field may be corrupted, as some

characters are destroyed in the conversion back and forth.

Use the --default-charset=latin1 option on mysqldump when creating your database dumps to avoid this lossy conversion. (And/or find another way to dump/copy databases or another equivalent option to avoid the unnecessary conversion.)

Since it appears that your hosting provider did this for you, you may need to ask them to redo it. Alternatively, you may be able to rig up a statistical fix based on which characters are being corrupted, though I'm not sure how easy that would be.

-- brion vibber (brion @ wikimedia.org)

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGO06ewRnhpk1wk44RAsDXAKCJWzzINvB0TKwsSMQ6s0HNGvompQCg1ESu 4G38Ult52BKTj3Ruq40UtJk= =xVIo -----END PGP SIGNATURE-----

MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l

-- Sylvain Machefert http://iubito.free.fr http://tousauxbalkans.jexiste.fr

Sylvain Machefert

7:56 p.m.

New subject: [Mediawiki-l] Sudden problem with some greek and cyrillic letters

I've an idea but I don't know how I can do this.

1) I could create a new column in wiki_page, called "page_title_blob", which is a blob containing the same as "page_title" but in a blob. This column will not have problem with dump/import. 2) Frequently I run a php script which copies page_title into page_title_blob. OR I hack the MW code to insert in page_title_blob on each insert/update of article title (create, move). 3) In case of dump/import or migration to another MySQL server, the problem appears. I run a script which copies page_title_blob into page_title.

Can you help me to write the 2 scripts, 'cause I don't know how to read/write in a blog and page_title with the encoding problem.

Thanks in advance

2007/5/4, Sylvain Machefert iubito@gmail.com:

...

Thanks for your explanation Brion. I didn't get an answer from my hosting provider, I asked them if they encountered a mysql crash or something else and the dump/import all the databases. I don't see another solution than the one I'm doing by hand... 140 rows are affected.

On greek, russian... wikipedias, do you use mysqldump --default-charset=latin1 and did you successfully reimported the dumps ? I'll ask my provider if it's possible to add the --default-charset=latin1 parameter in my automatic weekly dumps, but I'm afraid it couldn't. In PhpMyAdmin and SQLYog (Mysql client), I didn't see such option.

2007/5/4, Brion Vibber brion@wikimedia.org:

...
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Sylvain Machefert wrote:

...
Hi Brion, what is strange, is that only the titles are affected, not the content

of

...
the pages. Is that normal ?

Yes -- the page text is in a binary BLOB field, which will not undergo the bogus lossy conversion.

The summarize, the problem is roughly:

MediaWiki assumes that MySQL will preserve data that is put into it

MySQL sometimes corrupts the data

in a little more detail:

Due to the limitations of MySQL's Unicode support, but default we

continue to treat MySQL fields as binary and store pure UTF-8 Unicode in

them, although MySQL may have them listed as Latin-1 depending on your server's defaults.

The mysqldump backup program by default in 4.1 and later applies a

conversion of non-binary fields to UTF-8, with a marker to have them appropriately converted back when read in.

This conversion is lossy -- it treats Latin-1 as the Windows-1252 code

page, which is an extension of ISO 8859-1 with additional characters in the 128-159 range which in ISO 8859 and Unicode is supposed to contain non-printing control characters. Four of the code points in this range are not assigned in Windows-1252, and so cannot be converted to UTF-8 Unicode -- these characters are silently corrupted into "?" characters during the conversion if they appear.

The UTF-8 encoding of Unicode uses the byte values which correspond to

those four non-convertible characters.

As a result, UTF-8 text in a Latin-1 field may be corrupted, as some

characters are destroyed in the conversion back and forth.

Use the --default-charset=latin1 option on mysqldump when creating your database dumps to avoid this lossy conversion. (And/or find another way to dump/copy databases or another equivalent option to avoid the unnecessary conversion.)

Since it appears that your hosting provider did this for you, you may need to ask them to redo it. Alternatively, you may be able to rig up a statistical fix based on which characters are being corrupted, though I'm not sure how easy that would be.

-- brion vibber (brion @ wikimedia.org)

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGO06ewRnhpk1wk44RAsDXAKCJWzzINvB0TKwsSMQ6s0HNGvompQCg1ESu 4G38Ult52BKTj3Ruq40UtJk= =xVIo -----END PGP SIGNATURE-----

MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l

-- Sylvain Machefert http://iubito.free.fr http://tousauxbalkans.jexiste.fr

-- Sylvain Machefert http://iubito.free.fr http://tousauxbalkans.jexiste.fr

Frames Project

6 May 6 May

12:18 p.m.

New subject: [Mediawiki-l] The best way to backup a Mediawiki ?

Brion Vibber a écrit :

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Sylvain Machefert wrote:

...
Hi Brion, what is strange, is that only the titles are affected, not the content of the pages. Is that normal ?

Yes -- the page text is in a binary BLOB field, which will not undergo the bogus lossy conversion.

The summarize, the problem is roughly:

MediaWiki assumes that MySQL will preserve data that is put into it

MySQL sometimes corrupts the data

in a little more detail:

Due to the limitations of MySQL's Unicode support, but default we

continue to treat MySQL fields as binary and store pure UTF-8 Unicode in them, although MySQL may have them listed as Latin-1 depending on your server's defaults.

The mysqldump backup program by default in 4.1 and later applies a

conversion of non-binary fields to UTF-8, with a marker to have them appropriately converted back when read in.

This conversion is lossy -- it treats Latin-1 as the Windows-1252 code

page, which is an extension of ISO 8859-1 with additional characters in the 128-159 range which in ISO 8859 and Unicode is supposed to contain non-printing control characters. Four of the code points in this range are not assigned in Windows-1252, and so cannot be converted to UTF-8 Unicode -- these characters are silently corrupted into "?" characters during the conversion if they appear.

The UTF-8 encoding of Unicode uses the byte values which correspond to

those four non-convertible characters.

As a result, UTF-8 text in a Latin-1 field may be corrupted, as some

characters are destroyed in the conversion back and forth.

Use the --default-charset=latin1 option on mysqldump when creating your database dumps to avoid this lossy conversion. (And/or find another way to dump/copy databases or another equivalent option to avoid the unnecessary conversion.)

Since it appears that your hosting provider did this for you, you may need to ask them to redo it. Alternatively, you may be able to rig up a statistical fix based on which characters are being corrupted, though I'm not sure how easy that would be.

-- brion vibber (brion @ wikimedia.org)

Thank you Brion for these explanations. I now understand why I had problems using mysqldump.

It seems that few people are aware of this dangerous behaviour of mysql and mysqldump. Is there any documentation on the best way to backup a Mediawiki and its database, except the message upper ?

Francois Colonna

Rob Church

1:19 p.m.

New subject: [Mediawiki-l] The best way to backup a Mediawiki ?

On 06/05/07, Frames Project frames@lct.jussieu.fr wrote:

...

Is there any documentation on the best way to backup a Mediawiki and its database, except the message upper ?

There's http://www.mediawiki.org/wiki/Manual:Backing_up_a_wiki, which discusses what files to back up, as well as how to do the database, and it has a section which describes the character set conversion problem, and how to avoid it.

Rob Church

Thomas, Arjun

9 May 9 May

11:27 a.m.

New subject: [Mediawiki-l] Parsing text to a template

Hi,

I've got a rather complex requirement. I would like to standardize the content on all my wiki pages using a template. However this template is rather complex.

I know you can inherit multiple values to a template, but is there any way I can create a form that manages this?

I.E all a user needs to do is fill in the various text boxes and the data gets passed to the template.

Thoughts?

6443

Age (days ago)

6448

Last active (days ago)

mediawiki-l@lists.wikimedia.org

8 comments

5 participants

tags (0)

participants (5)

Brion Vibber
Frames Project
Rob Church
Sylvain Machefert
Thomas, Arjun