Felipe Ortega wrote:
--- El dom, 8/3/09, O. O. <olson_ot(a)yahoo.com> escribió:
I thought that the
pages-articles.xml.bz2 (i.e. the XML Dump) contains
the templates – but I did not find a way to do install it
separately.
No, it only contains a dump of the current version of each article (involving the page,
revision and text tables in the DB).
Thanks Felipe for posting.
pages-articles.xml.bz2 as mentioned at
http://download.wikimedia.org/enwiki/20081008/ Says that it is
“Articles, templates, image descriptions, and primary meta-pages.” What
does “templates” mean if it does not contain the templates??
Another thing I noticed (with the Portuguese Wiki
which is
a much
smaller dump than the English Wiki) is that the number of
pages imported
by importDump.php and MWDumper differ i.e. importDump.php
had much more
pages than MWDumper. That is way I would have preferred to
do this using
importDump.php.
On
download.wikimedia.org/your_lang_here you can check how many pages were supposed to be
included in each dump.
You also have other parsers you may want to check (in my experience, my parser was
slightly faster than mwdumper):
http://meta.wikimedia.org/wiki/WikiXRay_Python_parser
Here my concern is not about speed – but about integrity. I don’t mind
the import taking too long – as long as it completes. I used
importDump.php because it was listed as the “Recommended way” of
importing. But now I realize that no one has used it on a real Wikipedia
Dump.
Nonetheless, I would give your tool a try sometime over the next two
weeks or so.
Also in a previous post, you mentioned about
taking care
about the
“secondary link tables”. How do I do that? Does
“secondary links” refer
to language links, external links, template links, image
links, category
links, page links or something else?
On the same page for downloads you have a list of additional dumps in SQL format (then
compressed with gzip). I guess you may also want to import them (but of course, you
don't need a parser for them, they can be directly loaded in the DB).
Best,
F.
I have not tried these as yet. I would try them tomorrow and get back to
you i.e. the newsgroup.
Thanks again,
O. O.