Hi,
I hate to be resurrecting an old thread, but I think for the purpose of
completion I would like to post my experience with the Import of XML
Dumps of Wikipedia into Mediawiki, so that it would help someone else
looking for this information. I started this thread after all.
I was attempting to import the XML/SQL dumps of the English Wikipedia
http://download.wikimedia.org/enwiki/20081008/ (not the most recent
version) using the three methods described at
http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps
I. Using importDump.php:
While this is the recommended method, I did run into memory issues. The
PHP (CLI) runs out of memory after a day or two, and then you have to
restart the import. (The good thing is that it skips quickly over pages
it is already imported after the restart.) However the fact that this
crashed too many times made me give up on it.
II. Using mwdumper:
This is actually pretty fast, and does not give errors. However I could
not figure out why this imports only 6.1 Million Pages, as compared to
7.6 Millon pages in the dump mentioned above (not the most recent dump.)
The command line output correctly indicates that 7.6 M pages have been
processed – but when you count the entries in the page table, only 6.1M
show up. I don’t know what happens to the rest, because as far as I can
see there were no errors.
III. Using xml2sql:
Actually this is not the recommended way of importing the XML dumps
according to
http://meta.wikimedia.org/wiki/Xml2sql - but it is the only
way that really worked for me. However as compared to the other tools,
this needs to be compiled/installed to get it to work. As Joshua
suggested a simple:
$ xml2sql enwiki-20081008-pages-articles.xml
$ mysqlimport -u root -p --local wikidb ./{page,revision,text}.txt
worked for me.
Notes: Your local MediaWiki will still not look like the online wiki
(even after you take into account that Images do not come with these
dumps).
1. For that I first imported the SQL Dumps into the other tables that
were available at
http://download.wikimedia.org/enwiki/20081008/ (except
page – since you have already imported it by now.)
2. I next installed the extensions listed in the “Parser hooks” section
under “Installed extensions” on
http://en.wikipedia.org/wiki/Special:Version
3. Finally, I recommend that you use HTML Tidy, because even after the
above steps, the output is screwed up. The settings for HTML Tidy are in
the LocalSettings.php. These are not there by default, you need to get
them from includes/DefaultSettings.php. The settings that worked for me
were:
$wgUseTidy = true;
$wgAlwaysUseTidy = false;
$wgTidyBin = '/usr/bin/tidy';
$wgTidyConf = $IP.'/includes/tidy.conf';
$wgTidyOpts = '';
$wgTidyInternal = extension_loaded( 'tidy' );
And
$wgValidateAllHtml = false;
Ensure this last one is false - else you would get nothing for most of
the pages.
I hope the above information helps others who also want to Import of XML
Dumps of Wikipedia into Mediawiki.
Thanks to all who answered my posts,
O. O.