Hi,
I am not sure if this is the correct place to ask this – if not then please let me know which is the best place for such a question.
Does anyone have experience importing the Wikipedia XML Dumps into MediaWiki. I made an attempt with the English Wiki Dump as well as the Portuguese Wiki Dump, giving php (cli) 1024 MB of Memory in the php.ini file. Both of these attempts fail with out of memory errors.
I am using the lasted version of MediaWiki 1.14.0 and PHP 5.2.6-1+lenny2 with Suhosin-Patch 0.9.6.2 (cli) (built: Jan 26 2009 22:41:04).
Does anyone have experience with this import and how to avoid the memory errors? I can give it more memory – but it seems to be leaking memory over time.
Thanks again, O. O.
¡Sé el Bello 51 de People en Español! ¡Es tu oportunidad de Brillar! Sube tus fotos ya. http://www.51bello.com/
O. Olson wrote:
Hi,
I am not sure if this is the correct place to ask this – if not then please let me know which is the best place for such a question. Does anyone have experience importing the Wikipedia XML Dumps into MediaWiki. I made an attempt with the English Wiki Dump as well as the Portuguese Wiki Dump, giving php (cli) 1024 MB of Memory in the php.ini file. Both of these attempts fail with out of memory errors. I am using the lasted version of MediaWiki 1.14.0 and PHP 5.2.6-1+lenny2 with Suhosin-Patch 0.9.6.2 (cli) (built: Jan 26 2009 22:41:04). Does anyone have experience with this import and how to avoid the memory errors? I can give it more memory – but it seems to be leaking memory over time.
Thanks again, O. O.
Don't use importDump.php for a whole wiki dump, use MWDumper http://www.mediawiki.org/wiki/MWDumper
Platonides schrieb:
O. Olson wrote:
Does anyone have experience importing the Wikipedia XML Dumps into MediaWiki. I made an attempt with the English Wiki Dump as well as the Portuguese Wiki Dump, giving php (cli) 1024 MB of Memory in the php.ini file. Both of these attempts fail with out of memory errors.
Don't use importDump.php for a whole wiki dump, use MWDumper http://www.mediawiki.org/wiki/MWDumper
MWDumper doesn't fill the secondary link tables. Please see http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps for detailed instructions and considerations.
Also keep in mind that the english wikipedia is *huge*. You will need a decent database server to be able to process it. I wouldn't even try on a desktop/laptop.
-- daniel
Daniel Kinzler wrote:
Platonides schrieb: MWDumper doesn't fill the secondary link tables. Please see http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps for detailed instructions and considerations.
Also keep in mind that the english wikipedia is *huge*. You will need a decent database server to be able to process it. I wouldn't even try on a desktop/laptop.
-- daniel
Thanks Daniel. I have tried MWDumper and the results seem different from importDump.php i.e. the Formatting is messed up. In tracking down what I might be doing wrong - I would prefer to do this using the native method.
Secondly, my question here is regarding PHP – not about the Database. I don’t see how a memory leak in PHP can be caused by the Database.
Has anyone had practical experience with importDump.php? Did you face any memory issues?
Thanks O. O.
Thanks Daniel. I have tried MWDumper and the results seem different from importDump.php i.e. the Formatting is messed up. In tracking down what I might be doing wrong - I would prefer to do this using the native method.
That sounds very *very* odd. because page content is imported as-is in both cases, it's not processed in any way. The only thing I can imagine is that things don't look right if you don't have all the templates imported yet.
Secondly, my question here is regarding PHP – not about the Database. I don’t see how a memory leak in PHP can be caused by the Database.
Just a warning on the side. Have a few hundred gigs handy.
But yes, the memory leak should be investigated. PHP is prone to this kind of thing, because it's not designed for long running processes, and not much tested for this kind of thing. Same goes for MediaWiki, i'm afraid.
-- daniel
Daniel Kinzler wrote:
That sounds very *very* odd. because page content is imported as-is in both cases, it's not processed in any way. The only thing I can imagine is that things don't look right if you don't have all the templates imported yet.
Thanks Daniel. Yes, I think that this may be because the Templates are not imported. (Get a lot of "Template: ..."). Any suggestions on how to import the templates?
I thought that the pages-articles.xml.bz2 (i.e. the XML Dump) contains the templates – but I did not find a way to do install it separately.
Another thing I noticed (with the Portuguese Wiki which is a much smaller dump than the English Wiki) is that the number of pages imported by importDump.php and MWDumper differ i.e. importDump.php had much more pages than MWDumper. That is way I would have preferred to do this using importDump.php.
Also in a previous post, you mentioned about taking care about the “secondary link tables”. How do I do that? Does “secondary links” refer to language links, external links, template links, image links, category links, page links or something else?
Thanks for your patience
O.O.
O. O. schrieb:
Daniel Kinzler wrote:
That sounds very *very* odd. because page content is imported as-is in both cases, it's not processed in any way. The only thing I can imagine is that things don't look right if you don't have all the templates imported yet.
Thanks Daniel. Yes, I think that this may be because the Templates are not imported. (Get a lot of "Template: ..."). Any suggestions on how to import the templates?
I thought that the pages-articles.xml.bz2 (i.e. the XML Dump) contains the templates – but I did not find a way to do install it separately.
They should be contained. As it sais on the download page: "Articles, templates, image descriptions, and primary meta-pages."
Another thing I noticed (with the Portuguese Wiki which is a much smaller dump than the English Wiki) is that the number of pages imported by importDump.php and MWDumper differ i.e. importDump.php had much more pages than MWDumper. That is way I would have preferred to do this using importDump.php.
The number of pages should be the same. soudns to me that the import with mwdumper was simply incomplete. Any error messages?
Also in a previous post, you mentioned about taking care about the “secondary link tables”. How do I do that? Does “secondary links” refer to language links, external links, template links, image links, category links, page links or something else?
THis is exactly it. YOu can rebuild them using the rebuildAll.php maintenance script (or was it refreshAll? something like that). But that takes *very* long to run, and might result in the same memory problem you experienced before.
The alternative is to download dumps of these tables and improt them into mysql directly. They are available from the download site.
-- daniel
Daniel Kinzler wrote:
O. O. schrieb:
I thought that the pages-articles.xml.bz2 (i.e. the XML Dump) contains the templates – but I did not find a way to do install it separately.
They should be contained. As it sais on the download page: "Articles, templates, image descriptions, and primary meta-pages."
Thanks Daniel. I know that the templates are contained in pages-articles.xml.bz2. However as you said that Mwdumper may not be importing the templates, my question was how to do import it then?
Another thing I noticed (with the Portuguese Wiki which is a much smaller dump than the English Wiki) is that the number of pages imported by importDump.php and MWDumper differ i.e. importDump.php had much more pages than MWDumper. That is way I would have preferred to do this using importDump.php.
The number of pages should be the same. soudns to me that the import with mwdumper was simply incomplete. Any error messages?
Actually was intending to start a separate thread on this topic – because both Mwdumper and importDump.php both report that they are Skipping certain pages. I did not note down the error that I received from Mwdumper – but the errors from importDump.php look like what is below.
Skipping interwiki page title '<Page_Title>'
Anyway both have the word “Skipping …” as part of their error. I do not have the actual figures – but I noticed that importDump.php seemed to have more pages than Mwdumper. (I unfortunately did not save the output – so I cannot compare how many times I got these errors.)
Also in a previous post, you mentioned about taking care about the “secondary link tables”. How do I do that? Does “secondary links” refer to language links, external links, template links, image links, category links, page links or something else?
THis is exactly it. YOu can rebuild them using the rebuildAll.php maintenance script (or was it refreshAll? something like that). But that takes *very* long to run, and might result in the same memory problem you experienced before.
Yes, the script is called rebuildall.php and mentioned in http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps#Using_mwdumper – As you mentioned I was expecting memory problems with this too since importDump.php is already having memory issues.
The alternative is to download dumps of these tables and improt them into mysql directly. They are available from the download site.
-- daniel
I would try to import the Tables tomorrow to see what I get.
Thanks again,
O. O.
Platonides wrote:
Don't use importDump.php for a whole wiki dump, use MWDumper http://www.mediawiki.org/wiki/MWDumper
Thanks Platonides. I am just curious why does http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps#Using_importDump.ph... say that importDump.php is the recommended method for imports.
You need to understand that this page does warn that the import of large dumps would be "slow". My concern here is not about the “slowness” but the fact that this crashes with an Out Of Memory Error. I can give PHP more memory – but the usage just seems to grow over time.
Is this the correct place to ask such questions? Or are there better places
O. O.
Is this on MW older than 1.14? You may want to disable profiling if it is on.
-Aaron
-------------------------------------------------- From: "O. O. " olson_ot@yahoo.com Sent: Saturday, March 07, 2009 10:28 PM To: wikitech-l@lists.wikimedia.org Subject: Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki
Platonides wrote:
Don't use importDump.php for a whole wiki dump, use MWDumper http://www.mediawiki.org/wiki/MWDumper
Thanks Platonides. I am just curious why does http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps#Using_importDump.ph... say that importDump.php is the recommended method for imports.
You need to understand that this page does warn that the import of large dumps would be "slow". My concern here is not about the “slowness” but the fact that this crashes with an Out Of Memory Error. I can give PHP more memory – but the usage just seems to grow over time.
Is this the correct place to ask such questions? Or are there better places
O. O.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Jason Schulz wrote:
Is this on MW older than 1.14? You may want to disable profiling if it is on.
-Aaron
Thanks Jason/Aaron. No, this is the recent MW 1.14 – downloaded in the beginning of this week from http://www.mediawiki.org/wiki/Download.
wikitech-l@lists.wikimedia.org