I'm having a bit of trouble figuring out database dump XML. Looking at the articles dump I see page content is wrapped in <page> and </page> elements. What I don't see is how to determine what namespace an article correlates to. I see the namespace elements at the top of the file, but how do I match articles to the right namespace?
Mike O
Mike O wrote:
I'm having a bit of trouble figuring out database dump XML. Looking at the articles dump I see page content is wrapped in <page> and </page> elements.
First, look at the documentation:
http://meta.wikimedia.org/wiki/Data_dumps http://meta.wikimedia.org/wiki/Help:Export http://www.mediawiki.org/wiki/MWDumper
What I don't see is how to determine what namespace an article correlates to. I see the namespace elements at the top of the file, but how do I match articles to the right namespace?
The namespace prefix appears at the beginning of the page title, which appears as the text contents of the /mediawiki/page/title element, separated by a colon from the remaining title part.
The correspondence between namespace prefixes and internal numeric IDs is given in the /mediawiki/siteinfo/namespaces/namespace elements, so you can automate your processing to work properly with wikis in different languages or with different custom namespaces.
-- brion vibber (brion @ pobox.com)
Mike wrote:
...What I don't see is how to determine what namespace an article correlates to. I see the namespace elements at the top of the file, but how do I match articles to the right namespace?
As Brion already explained, they're sitting there right in each entry's title, using the same colon-separated syntax as they do through the normal wiki UIF.
It's worth noting, though, that the colon syntax isn't unambiguous. If you're looking for namespace prefixes, you have to look for only the ones that are known to exist in the project. You can't just look for colons, because there are plenty of articles with colons in their names, such as [[Boston Red Sox: Managers and Ownership]] and [[Star Trek: The Next Generation]], but obviously "Boston Red Sox" and "Star Trek" aren't namespaces.
wikitech-l@lists.wikimedia.org