Hi,
pages-articles does contain the text of category pages, so you could
recreate the category structure from it, but you would have to parse
the wikimarkup, if you want to find out parent categories. That is, if
you are okay with ignoring categories added through templates.
Another option is to use categorylinks.sql, which contains links
between pageids and categories. That means you would need to read
page.sql too. The normal way to use .sql files is to import them into
a (My)SQL database, but you can also parse them (I wrote my own parser
for them in .Net [1]).
Petr
[1]
https://github.com/svick/Wikipedia-SQl-dump-parser
On Wed, Aug 24, 2011 at 14:11, Imre Kovács <imrekovacs(a)msn.com> wrote:
Hello everyone,
I hope I'm not disturbing you too much, I have the following question:
I'm considering to download the enwiki-latest-pages-articles.xml, but I need
to know if this contains enough information to rebuild the category
structure (parent categories, subcategories, including the
Category:Contents, etc.). Does the dump include the category pages or only
the articles?
Thank you very much,
Imre
_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l