more import-related stuff - Xmldatadumps-l

18 Mar 2013

In my continued quest to Make Imports Suck Less (tm), I've written a
little perl script to shovel data from a tab-delimited escaped file to a
fifo in pieces while forking off mysql to LOAD DATA INFILE from the fifo
for each chunk.  It's only been tested on linux, specifically my laptop,
but I did run it using current article content dumps and all the
auxiliary tables for a wiki of a few hundred thousand articles, and it
worked ok.

You can find it in the xmlfileutils directory of my branch of the git
dumps repo:
https://gerrit.wikimedia.org/r/gitweb?p=operations/dumps.git;a=blob;f=xmlfi…

You'll notice that all my tools are linux + mysql, and that's because
that is what I use.  If folks want similar tools for other platforms
they'll have to write them, I don't have the expertise for that.

Ah also the docs on Meta about dumps have been reorganized and
rewritten, not that they are either error-free or complete but they
should be in much better shape now:
http://meta.wikimedia.org/wiki/Data_dumps

And lastly, the uncompressed en wp meta history dumps are now over 10T.
Yay?

As always, feedback, edits, patches welcome.

Ariel

P.S. Sorry Platonides but if you were going to rework a script of yours
you were too slow ;-)  (However if you have such a script with
different/better features I'll still take it.)