Correct way to Import SQL Dumps of Wikipedia into MediaWiki in Binary

List overview All Threads
Download

newer

older

PHP 5.2.10RC1 Testing

Darwinian search on wikipedia

O. O.

20 May 2009 20 May '09

1:18 a.m.

Hi, This may be a bit obvious – but I don’t have quite as much experience in this area. The SQL Dumps provided at http://download.wikimedia.org do not contain specifications for the “DEFAULT CHARSET” of the respective Table. When installing MediaWiki – it seems to be recommended to use the binary Charset. I would like to know how to import one of these dumps into a Table with the binary Charset.

Right now I import on the cmdline: E.g.

mysql wikidb < enwiki-20090306-pagelinks.sql

This results in the corresponding Table being dropped and then recreated again. The problem with this is that the newly created Table does not have the “DEFAULT CHARSET” set to Binary, because the SQL Dumps do not have these specified.

I first attempted to modify my my.cnf file to set the “DEFAULT CHARSET” to binary for new Tables. I attempted to make the following changes to my.cnf:

[client] default-character-set=binary [mysqld] default-character-set=binary default-collation=binary character-set-server=binary collation-server=binary init-connect='SET NAMES binary'

I restarted the Server – but I found that the new Table that gets created, gets created in UTF-8, not binary.

I then attempted to edit the SQL File i.e. replace the line

) TYPE=InnoDB;

With

) TYPE=InnoDB DEFAULT CHARSET=binary;

This works, in the sense that now the new Table gets created in Binary. However I think I am making mistakes in editing the file. These files are rather large, so I wrote code in Perl, and again in Java to do the editing. They can manage to do the above substitution, but I am not entirely confident about their UTF-8 handling. The problem appears when I am trying to import these modified files, where I get an error “Duplicate entry” e.g. for the enwiki-20090306-pagelinks.sql file, I get the error:

ERROR 1062 (23000) at line 1359: Duplicate entry '1198132-2-Gangleri/tests/links/ï¿½' for key 1

I would like to add that importing this file as UTF-8 results in this “Duplicate entry” error coming much earlier in the input file.

So, what’s the correct way of importing these SQL Dumps, such that they are imported into a Table in Binary? If my above description is not clear please let me know and I would try to explain again.

Thanks a lot, O. O.

P. S. I am running MediaWiki/MySQL under Ubuntu. I hope UTF-8 is handled correctly on the Commandline Bash – but I don’t know how to check that.

Show replies by date

Platonides

22 May 22 May

5:28 a.m.

New subject: Correct way to Import SQL Dumps of Wikipedia into MediaWiki in Binary

O. O. wrote:

...

Hi, This may be a bit obvious – but I don’t have quite as much experience in this area. The SQL Dumps provided at http://download.wikimedia.org do not contain specifications for the “DEFAULT CHARSET” of the respective Table. When installing MediaWiki – it seems to be recommended to use the binary Charset. I would like to know how to import one of these dumps into a Table with the binary Charset.

Right now I import on the cmdline: E.g.

mysql wikidb < enwiki-20090306-pagelinks.sql

This results in the corresponding Table being dropped and then recreated again. The problem with this is that the newly created Table does not have the “DEFAULT CHARSET” set to Binary, because the SQL Dumps do not have these specified.

I first attempted to modify my my.cnf file to set the “DEFAULT CHARSET” to binary for new Tables. I attempted to make the following changes to my.cnf:

[client] default-character-set=binary [mysqld] default-character-set=binary default-collation=binary character-set-server=binary collation-server=binary init-connect='SET NAMES binary'

I restarted the Server – but I found that the new Table that gets created, gets created in UTF-8, not binary.

I then attempted to edit the SQL File i.e. replace the line

) TYPE=InnoDB;

With

) TYPE=InnoDB DEFAULT CHARSET=binary;

This works, in the sense that now the new Table gets created in Binary. However I think I am making mistakes in editing the file. These files are rather large, so I wrote code in Perl, and again in Java to do the editing. They can manage to do the above substitution, but I am not entirely confident about their UTF-8 handling.

You can also use sed to edit it: $ sed -i "n;n;n;n;n;n;n;n;n;n;n;n;n;n;n;n;n;s/InnoDB/InnoDB DEFAULT CHARSET=binary/" enwiki-20090306-pagelinks.sql

will modify just that line (the 18th, adjust the number of 'n;'s should the schema change) to be ) TYPE=InnoDB DEFAULT CHARSET=binary;

...

The problem appears when I am trying to import these modified files, where I get an error “Duplicate entry” e.g. for the enwiki-20090306-pagelinks.sql file, I get the error:

ERROR 1062 (23000) at line 1359: Duplicate entry '1198132-2-Gangleri/tests/links/ï¿½' for key 1

I would like to add that importing this file as UTF-8 results in this “Duplicate entry” error coming much earlier in the input file.

I have looked at the entries from 1198132 (http://en.wikipedia.org/wiki/User:%D7%9C%D7%A2%D7%A8%D7%99_%D7%A8%D7%99%D7%9...) on that file and they aren't duplicated (the file is ok), but they stress the charset a bit (uses the full 0-255 range) so if mysql is not completely interpreting it as binary, it'll have problems.

It explains that in utf8 you get the problem earlier, but I don't know why exactly it's failing in your setup.

You can also try luck with yesterday's pagelinks.sql

...

So, what’s the correct way of importing these SQL Dumps, such that they are imported into a Table in Binary? If my above description is not clear please let me know and I would try to explain again.

Thanks a lot, O. O.

P. S. I am running MediaWiki/MySQL under Ubuntu. I hope UTF-8 is handled correctly on the Commandline Bash – but I don’t know how to check that.

O. O.

23 May 23 May

3:39 a.m.

New subject: Correct way to Import SQL Dumps of Wikipedia into MediaWiki in Binary

Platonides wrote:

...

O. O. wrote:

...
I then attempted to edit the SQL File i.e. replace the line

) TYPE=InnoDB;

With

) TYPE=InnoDB DEFAULT CHARSET=binary;

This works, in the sense that now the new Table gets created in Binary. However I think I am making mistakes in editing the file. These files are rather large, so I wrote code in Perl, and again in Java to do the editing. They can manage to do the above substitution, but I am not entirely confident about their UTF-8 handling.

You can also use sed to edit it: $ sed -i "n;n;n;n;n;n;n;n;n;n;n;n;n;n;n;n;n;s/InnoDB/InnoDB DEFAULT CHARSET=binary/" enwiki-20090306-pagelinks.sql

Thanks for your reply Platonides. I am trying your suggestion right now. It would take a few hours to crash – if it does. (I hope sed handles UTF-8 correctly.) I would try yesterdays pagelinks.sql later.

So assume if I do make the change as you suggested above i.e. specifically set the “DEFAULT CHARSET” to binary, would there be any problems importing using

...

$ mysql wikidb < enwiki-20090306-pagelinks.sql

I am using Linux (Ubuntu). My question is if the Shell which does the Pipe – would it have any effect of modifying the characters before mysql gets them. Right now I think the Shell supports UTF-8 – but I hope it is not messing things up.

Thanks a lot. O.O.

Aryeh Gregor

3:50 a.m.

On Fri, May 22, 2009 at 6:09 PM, O. O. olson_ot@yahoo.com wrote:

...

Thanks for your reply Platonides. I am trying your suggestion right now. It would take a few hours to crash – if it does. (I hope sed handles UTF-8 correctly.) I would try yesterdays pagelinks.sql later.

sed treats UTF-8 as a stream of bytes. Since the pattern won't match UTF-8 (UTF-8 only contains ASCII bytes if they represent ASCII code points), it will just ignore those bytes.

(That sed pattern is pretty horrifying and fragile, though. I'd recommend something more like: sed -i 's/^) TYPE=InnoDB;$/) TYPE=InnoDB DEFAULT CHARSET=binary;/' )

...

>$ mysql wikidb < enwiki-20090306-pagelinks.sql

I am using Linux (Ubuntu). My question is if the Shell which does the Pipe – would it have any effect of modifying the characters before mysql gets them. Right now I think the Shell supports UTF-8 – but I hope it is not messing things up.

The shell is only handing the mysql command a file descriptor. mysql will read the file itself directly, the shell won't touch any of the input.

O. O.

4:11 a.m.

New subject: Correct way to Import SQL Dumps of Wikipedia into MediaWiki in Binary

Aryeh Gregor wrote:

...

On Fri, May 22, 2009 at 6:09 PM, O. O. olson_ot@yahoo.com wrote:

(That sed pattern is pretty horrifying and fragile, though. I'd recommend something more like: sed -i 's/^) TYPE=InnoDB;$/) TYPE=InnoDB DEFAULT CHARSET=binary;/' )

...

The shell is only handing the mysql command a file descriptor. mysql will read the file itself directly, the shell won't touch any of the input.

Thanks Aryeh for this clarification. I was really feeling that shell was doing something – because I trusted the Wikipedia SQL dump. Thanks also for your sed command – it is more general i.e. works for other files where the line to be modified is not on line number 18. O.O.

Platonides

4:40 a.m.

New subject: Correct way to Import SQL Dumps of Wikipedia into MediaWiki in Binary

Aryeh Gregor wrote:

...

On Fri, May 22, 2009 at 6:09 PM, O. O. olson_ot@yahoo.com wrote:

...
Thanks for your reply Platonides. I am trying your suggestion right now. It would take a few hours to crash – if it does. (I hope sed handles UTF-8 correctly.) I would try yesterdays pagelinks.sql later.

sed treats UTF-8 as a stream of bytes. Since the pattern won't match UTF-8 (UTF-8 only contains ASCII bytes if they represent ASCII code points), it will just ignore those bytes.

(That sed pattern is pretty horrifying and fragile, though. I'd recommend something more like: sed -i 's/^) TYPE=InnoDB;$/) TYPE=InnoDB DEFAULT CHARSET=binary;/' )

I wanted to be sure that the command provided to olson only modified that line. The content below that may be everything. Not that I think your sed will fail in this case, given the allowed titles, but it *can* fail on some of the other sql dumps. And yes, the syntax is horrible. There's a way to do it providing a number instead of repeating the "n;", but it's a gnu extension.

O. O.

24 May 24 May

6:56 a.m.

New subject: Correct way to Import SQL Dumps of Wikipedia into MediaWiki in Binary

Platonides wrote:

...

Aryeh Gregor wrote:

...
On Fri, May 22, 2009 at 6:09 PM, O. O. olson_ot@yahoo.com wrote:

...
Thanks for your reply Platonides. I am trying your suggestion right now. It would take a few hours to crash – if it does. (I hope sed handles UTF-8 correctly.) I would try yesterdays pagelinks.sql later.

sed treats UTF-8 as a stream of bytes. Since the pattern won't match UTF-8 (UTF-8 only contains ASCII bytes if they represent ASCII code points), it will just ignore those bytes.

(That sed pattern is pretty horrifying and fragile, though. I'd recommend something more like: sed -i 's/^) TYPE=InnoDB;$/) TYPE=InnoDB DEFAULT CHARSET=binary;/' )

I wanted to be sure that the command provided to olson only modified that line. The content below that may be everything. Not that I think your sed will fail in this case, given the allowed titles, but it *can* fail on some of the other sql dumps. And yes, the syntax is horrible. There's a way to do it providing a number instead of repeating the "n;", but it's a gnu extension.

Thanks Platonides for the clarification on your syntax. Even if your syntax is bad, I can get another script (which does not modify the SQL file) to generate it.

Also, I think that your suggestion is working, because while my import has not yet completed, from the mysql> show processlist; it seems that it has crossed the point where it used to crash.

Keeping my fingers crossed till it completes, O. O.

O. O.

27 May 27 May

5:27 a.m.

New subject: Correct way to Import SQL Dumps of Wikipedia into MediaWiki in Binary

Thanks Platonides. Your suggestion worked. It imported into SQL without reporting any Duplicate entries.

O. O.

Aryeh Gregor

24 May 24 May

8:26 p.m.

On Fri, May 22, 2009 at 7:10 PM, Platonides Platonides@gmail.com wrote:

...

I wanted to be sure that the command provided to olson only modified that line. The content below that may be everything. Not that I think your sed will fail in this case, given the allowed titles, but it *can* fail on some of the other sql dumps.

Can it? Aren't newlines in the dumps encoded as \n, with real newlines only occurring as part of SQL statements rather than string literals? Note the ^ and $.

5667

Age (days ago)

5674

Last active (days ago)

wikitech-l@lists.wikimedia.org

8 comments

3 participants

tags (0)

participants (3)

Aryeh Gregor
O. O.
Platonides