Hi,
pages-articles does contain the text of category pages, so you could recreate the category structure from it, but you would have to parse the wikimarkup, if you want to find out parent categories. That is, if you are okay with ignoring categories added through templates.
Another option is to use categorylinks.sql, which contains links between pageids and categories. That means you would need to read page.sql too. The normal way to use .sql files is to import them into a (My)SQL database, but you can also parse them (I wrote my own parser for them in .Net [1]).
Petr
[1] https://github.com/svick/Wikipedia-SQl-dump-parser
On Wed, Aug 24, 2011 at 14:11, Imre Kovács imrekovacs@msn.com wrote:
Hello everyone,
I hope I'm not disturbing you too much, I have the following question:
I'm considering to download the enwiki-latest-pages-articles.xml, but I need to know if this contains enough information to rebuild the category structure (parent categories, subcategories, including the Category:Contents, etc.). Does the dump include the category pages or only the articles?
Thank you very much, Imre
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l