So partly due to recent work by folks like Kent on creating local WP
mirrors using the import process, and partly from helping walk someone
through the process for the zillionth time, I have come to the
realization that This Process Sucks (TM). I am not taking on the whole
stack, but I am trying to make a dent in at least part of it. To that
1) mwdumper available from download.wikimedia.org
is now the current
version and should run without a bunch of fancy tricks. Thanks to Chad
for fixing up the jenkins build. I tried it on a recent en wikipedia
current pages dump and it seemed to work. though I did not test
importing the output.
2) I have a couple of tools for *nix users importing into a MySQL
* Somewhat equivalent to mwdumper is 'mwxml2sql', name chosen before I
saw that there was a long abandonded xml2sql tool available in the wild.
Input: stubs and page content xml files, output: sql files for each of
page, revision, text table, reading 0.4 xsd through 0.7 and writing Mw
1.5 through 1.20 output, as specified by the user. Many specific
combinations are untested (e.g. I spent most work on 0.7 xds to MW
* Converting an sql dump file to a tab delimited format suitable for
'LOAD DATA INFILE' is now possible via 'sql2txt' (also *nix platforms).
I tested these on a smallish non-latin-character set wiki dump; a test
on en wikipedia is in the works but loading all those other tables, even
via LOAD DATA INFILE, takes some time.
Link to source:
So what I would love from folks is:
Test, find bugs, ask for features, tell me where other pain points are
in the import process. If you find bugs/want features and write a
patch, and you have a gerrit account, feel free to submit a changeset
righ there and add me as a reviewer. If you have a patch and don't have
an account, get one :-)
Once I know these are actually useful, I will try to make a dent in the
pages on Meta and elsewhere that describe, sometimes referring to
information several years old, how to import the dumps. Ah yeah and I'll
put up static binaries for linux/freebsd then too.