O. O. wrote:
Hi, This may be a bit obvious – but I don’t have quite as much experience in this area. The SQL Dumps provided at http://download.wikimedia.org do not contain specifications for the “DEFAULT CHARSET” of the respective Table. When installing MediaWiki – it seems to be recommended to use the binary Charset. I would like to know how to import one of these dumps into a Table with the binary Charset.
Right now I import on the cmdline: E.g.
mysql wikidb < enwiki-20090306-pagelinks.sql
This results in the corresponding Table being dropped and then recreated again. The problem with this is that the newly created Table does not have the “DEFAULT CHARSET” set to Binary, because the SQL Dumps do not have these specified.
I first attempted to modify my my.cnf file to set the “DEFAULT CHARSET” to binary for new Tables. I attempted to make the following changes to my.cnf:
[client] default-character-set=binary [mysqld] default-character-set=binary default-collation=binary character-set-server=binary collation-server=binary init-connect='SET NAMES binary'
I restarted the Server – but I found that the new Table that gets created, gets created in UTF-8, not binary.
I then attempted to edit the SQL File i.e. replace the line
) TYPE=InnoDB;
With
) TYPE=InnoDB DEFAULT CHARSET=binary;
This works, in the sense that now the new Table gets created in Binary. However I think I am making mistakes in editing the file. These files are rather large, so I wrote code in Perl, and again in Java to do the editing. They can manage to do the above substitution, but I am not entirely confident about their UTF-8 handling.
You can also use sed to edit it: $ sed -i "n;n;n;n;n;n;n;n;n;n;n;n;n;n;n;n;n;s/InnoDB/InnoDB DEFAULT CHARSET=binary/" enwiki-20090306-pagelinks.sql
will modify just that line (the 18th, adjust the number of 'n;'s should the schema change) to be ) TYPE=InnoDB DEFAULT CHARSET=binary;
The problem appears when I am trying to import these modified files, where I get an error “Duplicate entry” e.g. for the enwiki-20090306-pagelinks.sql file, I get the error:
ERROR 1062 (23000) at line 1359: Duplicate entry '1198132-2-Gangleri/tests/links/�' for key 1
I would like to add that importing this file as UTF-8 results in this “Duplicate entry” error coming much earlier in the input file.
I have looked at the entries from 1198132 (http://en.wikipedia.org/wiki/User:%D7%9C%D7%A2%D7%A8%D7%99_%D7%A8%D7%99%D7%9...) on that file and they aren't duplicated (the file is ok), but they stress the charset a bit (uses the full 0-255 range) so if mysql is not completely interpreting it as binary, it'll have problems.
It explains that in utf8 you get the problem earlier, but I don't know why exactly it's failing in your setup.
You can also try luck with yesterday's pagelinks.sql
So, what’s the correct way of importing these SQL Dumps, such that they are imported into a Table in Binary? If my above description is not clear please let me know and I would try to explain again.
Thanks a lot, O. O.
P. S. I am running MediaWiki/MySQL under Ubuntu. I hope UTF-8 is handled correctly on the Commandline Bash – but I don’t know how to check that.