Hi, I hate to be resurrecting an old thread, but I think for the purpose of completion I would like to post my experience with the Import of XML Dumps of Wikipedia into Mediawiki, so that it would help someone else looking for this information. I started this thread after all.
I was attempting to import the XML/SQL dumps of the English Wikipedia http://download.wikimedia.org/enwiki/20081008/ (not the most recent version) using the three methods described at http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps
I. Using importDump.php: While this is the recommended method, I did run into memory issues. The PHP (CLI) runs out of memory after a day or two, and then you have to restart the import. (The good thing is that it skips quickly over pages it is already imported after the restart.) However the fact that this crashed too many times made me give up on it.
II. Using mwdumper: This is actually pretty fast, and does not give errors. However I could not figure out why this imports only 6.1 Million Pages, as compared to 7.6 Millon pages in the dump mentioned above (not the most recent dump.) The command line output correctly indicates that 7.6 M pages have been processed – but when you count the entries in the page table, only 6.1M show up. I don’t know what happens to the rest, because as far as I can see there were no errors.
III. Using xml2sql: Actually this is not the recommended way of importing the XML dumps according to http://meta.wikimedia.org/wiki/Xml2sql - but it is the only way that really worked for me. However as compared to the other tools, this needs to be compiled/installed to get it to work. As Joshua suggested a simple: $ xml2sql enwiki-20081008-pages-articles.xml $ mysqlimport -u root -p --local wikidb ./{page,revision,text}.txt
worked for me.
Notes: Your local MediaWiki will still not look like the online wiki (even after you take into account that Images do not come with these dumps). 1. For that I first imported the SQL Dumps into the other tables that were available at http://download.wikimedia.org/enwiki/20081008/ (except page – since you have already imported it by now.) 2. I next installed the extensions listed in the “Parser hooks” section under “Installed extensions” on http://en.wikipedia.org/wiki/Special:Version 3. Finally, I recommend that you use HTML Tidy, because even after the above steps, the output is screwed up. The settings for HTML Tidy are in the LocalSettings.php. These are not there by default, you need to get them from includes/DefaultSettings.php. The settings that worked for me were: $wgUseTidy = true; $wgAlwaysUseTidy = false; $wgTidyBin = '/usr/bin/tidy'; $wgTidyConf = $IP.'/includes/tidy.conf'; $wgTidyOpts = ''; $wgTidyInternal = extension_loaded( 'tidy' );
And
$wgValidateAllHtml = false;
Ensure this last one is false - else you would get nothing for most of the pages.
I hope the above information helps others who also want to Import of XML Dumps of Wikipedia into Mediawiki.
Thanks to all who answered my posts, O. O.