how to convert the latin1 SQL dump back into UTF-8?

List overview All Threads
Download

newer

older

It makes more overload to delete...

Wikipedia is full

jidanni＠jidanni.org

5 Mar 2009 5 Mar '09

8:24 a.m.

Say, e.g., api.php?action=query&list=logevents looks fine, but when I look at the same table in an SQL dump, the Chinese utf8 is just a latin1 jumble. How can I convert such strings back to utf8? I can't find the place where MediaWiki converts them back and forth.

You see I'm just curious, let's say all I had was the SQL dump, and GNU/Linux tools, but no MediaWiki. How can I get the original UTF-8 strings back out of the SQL dump?

Show replies by date

Daniel Kinzler

5 Mar 5 Mar

8:39 a.m.

jidanni@jidanni.org schrieb:

...

Say, e.g., api.php?action=query&list=logevents looks fine, but when I look at the same table in an SQL dump, the Chinese utf8 is just a latin1 jumble. How can I convert such strings back to utf8? I can't find the place where MediaWiki converts them back and forth.

It doesn't. it's already UTF8, only mysql things it's not. this is because mysql doesn't support utf8 before 5.0, and even in 5.0 and later, the support is flacky.

So, mediawiki (per default) tells mysql that the data is latin1 and treates it as binary.

If you see it asa "jumble" entirely depends on the program you view it with.

this is a nasty hack, and it may cause corruption when importing/exporting dumps. be careful about it.

-- daniel

jidanni＠jidanni.org

11:21 p.m.

The BLOBs are fine, it's just the VARCHARs, `ar_title` varchar(255) character set latin1 collate latin1_bin NOT NULL default '', How can one convert these back to UTF-8 with a script, outside of mysql, just for occasional viewing of the SQL dumps outside of the wiki. Yes, my wiki works fine. OK, I'll study http://www.oreillynet.com/onlamp/blog/2006/01/turning_mysql_data_in_latin1_t...

Daniel Kinzler

6 Mar 6 Mar

5:54 a.m.

jidanni@jidanni.org schrieb:

...

The BLOBs are fine, it's just the VARCHARs, `ar_title` varchar(255) character set latin1 collate latin1_bin NOT NULL default '', How can one convert these back to UTF-8 with a script, outside of mysql, just for occasional viewing of the SQL dumps outside of the wiki. Yes, my wiki works fine. OK, I'll study http://www.oreillynet.com/onlamp/blog/2006/01/turning_mysql_data_in_latin1_t...

Again: never mind what it is declared as, it *is* UTF-8. MySQL may however automatically convert it on the way to the clinet or dump program. To prevent that, tell mysql that the encoding of your client is latin1. Confusing? Hell yea :)

-- daniel

howard chen

7 Mar 7 Mar

7:50 a.m.

On Fri, Mar 6, 2009 at 3:54 PM, Daniel Kinzler daniel@brightbyte.de wrote:

...

Again: never mind what it is declared as, it *is* UTF-8. MySQL may however automatically convert it on the way to the clinet or dump program. To prevent that, tell mysql that the encoding of your client is latin1. Confusing? Hell yea :)

Best way is to use VARBINARY or BINARY

jidanni＠jidanni.org

11 Mar 11 Mar

3:39 a.m.

OK, I found if I use "mysqldump --default-character-set=latin1" I can read all that can be read in the dump. The only difference from plain mysqldump is -/*!40101 SET NAMES utf8 */; +/*!40101 SET NAMES latin1 */; But that doesn't seem to affect restores from the SQL file. I'm sold.

Tei

6:26 a.m.

note to self: look into the code that order text (collation) in mediawiki, has to be fun one :-)

-- -- ℱin del ℳensaje.

Daniel Kinzler

8:14 a.m.

Tei schrieb:

...

note to self: look into the code that order text (collation) in mediawiki, has to be fun one :-)

There is none. Sorting is done by the database. That is to say, in the default "comnpatibility" mode, binary "collation" is used - that is, byte-by-byte comparison of UTF-8 encoded data. Which sucks. But we are stuck with it until MySQL gets proper Unicode support.

If you set up the database to use proper UTF-8, collation is a bit better (though still not configurable, i think). But it crashes hard if you try to store characters that are outside the Basic Multilingual Plane (Gothic runes, some obscure Chinese characters, ...) - that's why this is not used on wikipedia.

-- daniel

Aryeh Gregor

11:20 a.m.

On Wed, Mar 11, 2009 at 6:14 AM, Daniel Kinzler daniel@brightbyte.de wrote:

...

There is none. Sorting is done by the database. That is to say, in the default "comnpatibility" mode, binary "collation" is used - that is, byte-by-byte comparison of UTF-8 encoded data. Which sucks. But we are stuck with it until MySQL gets proper Unicode support.

And until we upgrade to that version. MySQL 4 doesn't have *any* Unicode support -- or any character encoding support, in fact. Every is binary.

But we don't have to wait on MySQL. We would just have to store a Unicode sortkey in cl_sortkey instead of the actual Unicode characters. This would require an implementation of a Unicode sorting algorithm in MediaWiki. It could be language-specific or whatever you want.

Petr Kadlec

11:58 a.m.

2009/3/11 Aryeh Gregor Simetrical+wikilist@gmail.com:

...

But we don't have to wait on MySQL. We would just have to store a Unicode sortkey in cl_sortkey instead of the actual Unicode characters. This would require an implementation of a Unicode sorting algorithm in MediaWiki. It could be language-specific or whatever you want.

I still hold the belief that implementing an Unicode sorting algorithm is none of the business of a PHP wiki engine (like implementing its own file system). But still, if that is the only way my favorite https://bugzilla.wikimedia.org/show_bug.cgi?id=164 would get resolved…

-- [[cs:User:Mormegil | Petr Kadlec]]

Daniel Kinzler

12:04 p.m.

Aryeh Gregor schrieb:

...

On Wed, Mar 11, 2009 at 6:14 AM, Daniel Kinzler daniel@brightbyte.de wrote:

...
There is none. Sorting is done by the database. That is to say, in the default "comnpatibility" mode, binary "collation" is used - that is, byte-by-byte comparison of UTF-8 encoded data. Which sucks. But we are stuck with it until MySQL gets proper Unicode support.

And until we upgrade to that version. MySQL 4 doesn't have *any* Unicode support -- or any character encoding support, in fact. Every is binary.

right :)

...

But we don't have to wait on MySQL. We would just have to store a Unicode sortkey in cl_sortkey instead of the actual Unicode characters. This would require an implementation of a Unicode sorting algorithm in MediaWiki. It could be language-specific or whatever you want.

Yes, i thought about that a bit too. One problem would be that you can't use that to make pretty sections on the category page. But that would be solvable using an extra column, I suppose. Or by some kind of extra special magic mapping.

-- daniel

Aryeh Gregor

12:19 p.m.

On Wed, Mar 11, 2009 at 10:04 AM, Daniel Kinzler daniel@brightbyte.de wrote:

...

Yes, i thought about that a bit too. One problem would be that you can't use that to make pretty sections on the category page. But that would be solvable using an extra column, I suppose. Or by some kind of extra special magic mapping.

While we're implementing obscene hacks, we can reserve the first byte for the namespace (to sort subcats/pages/files separately), use the middle for a Unicode sort key, and reserve the last four bytes for a UTF-8 header character (which could also be language-specific). :D

Gerard Meijssen

5:31 p.m.

Hoi, If you are interested in collation, you may want to look into the CLDR, it is where the collations are registered per language. There is no such thing as an universally correct sorting algorithm.. NB the CLDR is a UNICODE project. Thanks, GerardM

2009/3/11 Aryeh Gregor <Simetrical+wikilist@gmail.comSimetrical%2Bwikilist@gmail.com

...

On Wed, Mar 11, 2009 at 6:14 AM, Daniel Kinzler daniel@brightbyte.de wrote:

...
There is none. Sorting is done by the database. That is to say, in the

default

...
"comnpatibility" mode, binary "collation" is used - that is, byte-by-byte comparison of UTF-8 encoded data. Which sucks. But we are stuck with it

until

...
MySQL gets proper Unicode support.

And until we upgrade to that version. MySQL 4 doesn't have *any* Unicode support -- or any character encoding support, in fact. Every is binary.

But we don't have to wait on MySQL. We would just have to store a Unicode sortkey in cl_sortkey instead of the actual Unicode characters. This would require an implementation of a Unicode sorting algorithm in MediaWiki. It could be language-specific or whatever you want.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Aran

10:28 p.m.

New subject: Illegal mix of collations after upgrade to 1.14

Hi I've tried upgrading my 1.11 to 1.14 and get this "illegal mix of collations" error. I went through the normal upgrade procedure first but this failed, so I then tried exporting as XML and importing into a completely fresh 1.14 install, and still I get the error!

I've found that by setting $wgDBmysql5 to false things seem to work ok, but is this really a good solution? I'd like everything to be running up to date, not in some backward compatibility mode... does anyone have any idea how to fix the problem properly?

Platonides

12 Mar 12 Mar

1:11 p.m.

New subject: Illegal mix of collations after upgrade to 1.14

Aran wrote:

...

I've found that by setting $wgDBmysql5 to false things seem to work ok, but is this really a good solution? I'd like everything to be running up to date, not in some backward compatibility mode... does anyone have any idea how to fix the problem properly?

It's not a backwards compatibility mode. It's just a different configuration. In fact, Wikimedia servers are working with mysql 4.0! :)

Daniel Kinzler

1:32 p.m.

New subject: Illegal mix of collations after upgrade to 1.14

Platonides schrieb:

...

Aran wrote:

...
I've found that by setting $wgDBmysql5 to false things seem to work ok, but is this really a good solution? I'd like everything to be running up to date, not in some backward compatibility mode... does anyone have any idea how to fix the problem properly?

It's not a backwards compatibility mode. It's just a different configuration. In fact, Wikimedia servers are working with mysql 4.0! :)

Well, it *is* called compatibility mode. I suspect because it works with old versions of mysql. In any case, the "experimental" modes should be fixed if broken. But they *are* experimental. So it's no surpriese if they are broken.

-- daniel

5779

Age (days ago)

5786

Last active (days ago)

wikitech-l@lists.wikimedia.org

15 comments

9 participants

tags (0)

participants (9)

Aran
Aryeh Gregor
Daniel Kinzler
Gerard Meijssen
howard chen
jidanni＠jidanni.org
Petr Kadlec
Platonides
Tei