"Platonides" Platonides@gmail.com wrote in message news:efo9g3$tee$1@sea.gmane.org...
That's a pretty good solution, although one of the issues is that the title includes the namespace, which needs to be removed to get the actual page title. I feel that the <page> section should be complete in and of
itself,
without requiring the header section mapping namespace names to ids. Without knowing the mappings (ns to ns-title) that are present in the header,
you
cannot interpret the title unambiguosly, for example <title ns="0">Star Trek: The Next Generation</title> relies on the parser knowing that ns-0 is not called 'Star Trek' in order to be interpreted properly.
How about <title ns="12" ns-title="Help">Contents</title>?
- Mark Clements (HappyDog)
I think you could assume that any non-zero namespace has prefix so you'd only need to split on the first ':' if it has a namespace number != 0
(this
assumes we will never setup a namespace with ':' in it).
BTW: why are you having so much trouble with this?
Personally - I have no trouble, as I don't use the dumps :)
I'm looking at it from a technical point of view - you need to be able to unambiguously know the name of the page, ideally without requiring the header information (thus allowing individual pages to be spat out and manipulated in XML without requiring extra meta-data). If you cannot do that, then I see it as a short-coming in the data schema.
If the following statements are all true, then there is no ambiguity. If any of them are false then that ambiguity remains:
1) The main namespace can never have a prefix (ns title is always empty) 2) All other namespaces _must_ have a prefix 3) A namespace title cannot contain a colon
If they are all true then your assumption above is valid, otherwise it is not and you will parse some edge-case titles incorrectly.
- Mark Clements (HappyDog)