So partly due to recent work by folks like Kent on creating local WP mirrors using the import process, and partly from helping walk someone through the process for the zillionth time, I have come to the realization that This Process Sucks (TM). I am not taking on the whole stack, but I am trying to make a dent in at least part of it. To that end:
1) mwdumper available from download.wikimedia.org is now the current version and should run without a bunch of fancy tricks. Thanks to Chad for fixing up the jenkins build. I tried it on a recent en wikipedia current pages dump and it seemed to work. though I did not test importing the output.
2) I have a couple of tools for *nix users importing into a MySQL database.
* Somewhat equivalent to mwdumper is 'mwxml2sql', name chosen before I saw that there was a long abandonded xml2sql tool available in the wild. Input: stubs and page content xml files, output: sql files for each of page, revision, text table, reading 0.4 xsd through 0.7 and writing Mw 1.5 through 1.20 output, as specified by the user. Many specific combinations are untested (e.g. I spent most work on 0.7 xds to MW 1.20).
* Converting an sql dump file to a tab delimited format suitable for 'LOAD DATA INFILE' is now possible via 'sql2txt' (also *nix platforms).
I tested these on a smallish non-latin-character set wiki dump; a test on en wikipedia is in the works but loading all those other tables, even via LOAD DATA INFILE, takes some time.
Link to source: https://gerrit.wikimedia.org/r/gitweb?p=operations/dumps.git;a=tree;f=xmlfil...
So what I would love from folks is:
Test, find bugs, ask for features, tell me where other pain points are in the import process. If you find bugs/want features and write a patch, and you have a gerrit account, feel free to submit a changeset righ there and add me as a reviewer. If you have a patch and don't have an account, get one :-)
Once I know these are actually useful, I will try to make a dent in the pages on Meta and elsewhere that describe, sometimes referring to information several years old, how to import the dumps. Ah yeah and I'll put up static binaries for linux/freebsd then too.
Thanks!
Ariel