Thanks Joshua. I would prefer that you post to the Mailing List / Newsgroup – so that all can benefit from your ideas.
--- El dom 8-mar-09, Joshua C. Lerner jlerner@gmail.com escribió:
De: Joshua C. Lerner jlerner@gmail.com Asunto: Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki
Just for kicks I decided to try to do an import of ptwiki - using what I learned in bringing up mirrors of 4 Greek and 3 English Wikimedia sites, including Greek Wikipedia. Basically I had the best luck with Xml2sql (http://meta.wikimedia.org/wiki/Xml2sql). The conversion from XML to SQL went smoothly:
# ./xml2sql /mnt/pt/ptwiki-20090128-pages-articles.xml
As did the import:
# mysqlimport -u root -p --local pt ./{page,revision,text}.txt Enter password: pt.page: Records: 1044220 Deleted: 0 Skipped: 0 Warnings: 0 pt.revision: Records: 1044220 Deleted: 0 Skipped: 0 Warnings: 3 pt.text: Records: 1044220 Deleted: 0 Skipped: 0 Warnings: 0
I'm running maintenance/rebuildall.php at the moment:
# php rebuildall.php ** Rebuilding fulltext search index (if you abort this will break searching; run this script again to fix): Dropping index... Rebuilding index fields for 2119470 pages... 442500
(still running)
I'll send a note to the list with the results of this experiment. Let me know if you need additional information or help. Are you trying to set up any mirrors?
Joshua
Thanks for making this attempt. Let me know if your rebuildall.php has memory issues.
This is really getting confusing for me – because there are so many ways – all of which guaranteed to work – that work, and the one that is recommended – does not seem to work.
I would try out your approach too – but it would take time as I only have one computer to spare.
Thanks, O.o.
¡Sé el Bello 51 de People en Español! ¡Es tu oportunidad de Brillar! Sube tus fotos ya. http://www.51bello.com/
On Sun, Mar 8, 2009 at 6:49 PM, O. Olson olson_ot@yahoo.com wrote:
Thanks Joshua. I would prefer that you post to the Mailing List / Newsgroup – so that all can benefit from your ideas.
Well like I said, i was going to email the list eventually. ;-)
Thanks for making this attempt. Let me know if your rebuildall.php has memory issues.
Seems fine - steady at 2.2% of memory available.
This is really getting confusing for me – because there are so many ways – all of which guaranteed to work – that work, and the one that is recommended – does not seem to work.
I think you mean "all of which are *not* guaranteed to work".
I would try out your approach too – but it would take time as I only have one computer to spare.
If you want I can just send you a database dump. Either now, or after rebuildall.php all finishes. Right now, it's now refreshing the links table, but only up to page_id 34,100 out of over 2 million pages. It'll be running for days.
Joshua
Thanks Joshua. I am intending to try two approaches. The first being to use the xml2sql and then fill the rest of the tables with the individual dumps of the Tables that are already provided in SQL. The second would be using Mwdumper – and then import the rest of the Tables using the SQL Dumps already provided to see if there are any differences.
Joshua C. Lerner wrote:
Thanks for making this attempt. Let me know if your rebuildall.php has memory issues.
Seems fine - steady at 2.2% of memory available.
This is really getting confusing for me – because there are so many ways – all of which guaranteed to work – that work, and the one that is recommended – does not seem to work.
I think you mean "all of which are *not* guaranteed to work".
I would try out your approach too – but it would take time as I only have one computer to spare.
If you want I can just send you a database dump. Either now, or after rebuildall.php all finishes. Right now, it's now refreshing the links table, but only up to page_id 34,100 out of over 2 million pages. It'll be running for days.
Joshua
Thanks for posting your experience with rebuildall.php. I think I might be able to live with the bad syntax that I get – if I cannot manage to get this to work. Thanks again, O. O.
I don't remember if I already mentioned this: you can split the xml file * into smaller pieces then import it using importDump.php.
Use a loop to make a file like this and then run it: #!/bin/bash php maintenance/importDump.php < /path/pagexml.1 wait php maintenance/importDump.php < /path/pagexml.2 ...
I haven't tried to start many php importDump.php processes working on different xml files simultaneously, will it work?
* = http://blog.prashanthellina.com/2007/10/17/ways-to-process-and-use-wikipedia...
Mohamed Magdy wrote:
I don't remember if I already mentioned this: you can split the xml file * into smaller pieces then import it using importDump.php.
Use a loop to make a file like this and then run it: #!/bin/bash php maintenance/importDump.php < /path/pagexml.1 wait php maintenance/importDump.php < /path/pagexml.2 ...
I haven't tried to start many php importDump.php processes working on different xml files simultaneously, will it work?
Thanks Mohamed – This is a good suggestion, but I am a bit vary to try it, because if I later have problems, I would not be sure if it is because I used this script to split the XML files.
I understand that the script looks OK, in that it simply splits the XML files at the “</page>” boundaries – but I don’t know a lot on how this would effect the final result.
Thanks again,
O. O.
Hi, I hate to be resurrecting an old thread, but I think for the purpose of completion I would like to post my experience with the Import of XML Dumps of Wikipedia into Mediawiki, so that it would help someone else looking for this information. I started this thread after all.
I was attempting to import the XML/SQL dumps of the English Wikipedia http://download.wikimedia.org/enwiki/20081008/ (not the most recent version) using the three methods described at http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps
I. Using importDump.php: While this is the recommended method, I did run into memory issues. The PHP (CLI) runs out of memory after a day or two, and then you have to restart the import. (The good thing is that it skips quickly over pages it is already imported after the restart.) However the fact that this crashed too many times made me give up on it.
II. Using mwdumper: This is actually pretty fast, and does not give errors. However I could not figure out why this imports only 6.1 Million Pages, as compared to 7.6 Millon pages in the dump mentioned above (not the most recent dump.) The command line output correctly indicates that 7.6 M pages have been processed – but when you count the entries in the page table, only 6.1M show up. I don’t know what happens to the rest, because as far as I can see there were no errors.
III. Using xml2sql: Actually this is not the recommended way of importing the XML dumps according to http://meta.wikimedia.org/wiki/Xml2sql - but it is the only way that really worked for me. However as compared to the other tools, this needs to be compiled/installed to get it to work. As Joshua suggested a simple: $ xml2sql enwiki-20081008-pages-articles.xml $ mysqlimport -u root -p --local wikidb ./{page,revision,text}.txt
worked for me.
Notes: Your local MediaWiki will still not look like the online wiki (even after you take into account that Images do not come with these dumps). 1. For that I first imported the SQL Dumps into the other tables that were available at http://download.wikimedia.org/enwiki/20081008/ (except page – since you have already imported it by now.) 2. I next installed the extensions listed in the “Parser hooks” section under “Installed extensions” on http://en.wikipedia.org/wiki/Special:Version 3. Finally, I recommend that you use HTML Tidy, because even after the above steps, the output is screwed up. The settings for HTML Tidy are in the LocalSettings.php. These are not there by default, you need to get them from includes/DefaultSettings.php. The settings that worked for me were: $wgUseTidy = true; $wgAlwaysUseTidy = false; $wgTidyBin = '/usr/bin/tidy'; $wgTidyConf = $IP.'/includes/tidy.conf'; $wgTidyOpts = ''; $wgTidyInternal = extension_loaded( 'tidy' );
And
$wgValidateAllHtml = false;
Ensure this last one is false - else you would get nothing for most of the pages.
I hope the above information helps others who also want to Import of XML Dumps of Wikipedia into Mediawiki.
Thanks to all who answered my posts, O. O.
wikitech-l@lists.wikimedia.org