--- El dom, 8/3/09, O. O. olson_ot@yahoo.com escribió:
I thought that the pages-articles.xml.bz2 (i.e. the XML Dump) contains the templates – but I did not find a way to do install it separately.
No, it only contains a dump of the current version of each article (involving the page, revision and text tables in the DB).
Another thing I noticed (with the Portuguese Wiki which is a much smaller dump than the English Wiki) is that the number of pages imported by importDump.php and MWDumper differ i.e. importDump.php had much more pages than MWDumper. That is way I would have preferred to do this using importDump.php.
On download.wikimedia.org/your_lang_here you can check how many pages were supposed to be included in each dump.
You also have other parsers you may want to check (in my experience, my parser was slightly faster than mwdumper): http://meta.wikimedia.org/wiki/WikiXRay_Python_parser
Also in a previous post, you mentioned about taking care about the “secondary link tables”. How do I do that? Does “secondary links” refer to language links, external links, template links, image links, category links, page links or something else?
On the same page for downloads you have a list of additional dumps in SQL format (then compressed with gzip). I guess you may also want to import them (but of course, you don't need a parser for them, they can be directly loaded in the DB).
Best,
F.
Thanks for your patience
O.O.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Felipe Ortega wrote:
--- El dom, 8/3/09, O. O. olson_ot@yahoo.com escribió:
I thought that the
pages-articles.xml.bz2 (i.e. the XML Dump) contains the templates – but I did not find a way to do install it separately.
No, it only contains a dump of the current version of each article (involving the page, revision and text tables in the DB).
Thanks Felipe for posting.
pages-articles.xml.bz2 as mentioned at http://download.wikimedia.org/enwiki/20081008/ Says that it is “Articles, templates, image descriptions, and primary meta-pages.” What does “templates” mean if it does not contain the templates??
Another thing I noticed (with the Portuguese Wiki which is a much smaller dump than the English Wiki) is that the number of pages imported by importDump.php and MWDumper differ i.e. importDump.php had much more pages than MWDumper. That is way I would have preferred to do this using importDump.php.
On download.wikimedia.org/your_lang_here you can check how many pages were supposed to be included in each dump.
You also have other parsers you may want to check (in my experience, my parser was slightly faster than mwdumper): http://meta.wikimedia.org/wiki/WikiXRay_Python_parser
Here my concern is not about speed – but about integrity. I don’t mind the import taking too long – as long as it completes. I used importDump.php because it was listed as the “Recommended way” of importing. But now I realize that no one has used it on a real Wikipedia Dump.
Nonetheless, I would give your tool a try sometime over the next two weeks or so.
Also in a previous post, you mentioned about taking care about the “secondary link tables”. How do I do that? Does “secondary links” refer to language links, external links, template links, image links, category links, page links or something else?
On the same page for downloads you have a list of additional dumps in SQL format (then compressed with gzip). I guess you may also want to import them (but of course, you don't need a parser for them, they can be directly loaded in the DB).
Best,
F.
I have not tried these as yet. I would try them tomorrow and get back to you i.e. the newsgroup.
Thanks again, O. O.
wikitech-l@lists.wikimedia.org