Mike O wrote:
I'm having a bit of trouble figuring out database dump XML. Looking at the articles dump I see page content is wrapped in <page> and </page> elements.
First, look at the documentation:
http://meta.wikimedia.org/wiki/Data_dumps http://meta.wikimedia.org/wiki/Help:Export http://www.mediawiki.org/wiki/MWDumper
What I don't see is how to determine what namespace an article correlates to. I see the namespace elements at the top of the file, but how do I match articles to the right namespace?
The namespace prefix appears at the beginning of the page title, which appears as the text contents of the /mediawiki/page/title element, separated by a colon from the remaining title part.
The correspondence between namespace prefixes and internal numeric IDs is given in the /mediawiki/siteinfo/namespaces/namespace elements, so you can automate your processing to work properly with wikis in different languages or with different custom namespaces.
-- brion vibber (brion @ pobox.com)