"Platonides" <Platonides(a)gmail.com> wrote in
message news:efo9g3$tee$1@sea.gmane.org...
> That's a pretty good solution, although one
of the issues is that the
> title
> includes the namespace, which needs to be removed to get the actual page
> title. I feel that the <page> section should be complete in and of
itself,
> without requiring the header section mapping
namespace names to ids.
> Without
> knowing the mappings (ns to ns-title) that are present in the header,
you
cannot
interpret the title unambiguosly, for example <title ns="0">Star
Trek: The Next Generation</title> relies on the parser knowing that ns-0
is
not called 'Star Trek' in order to be interpreted properly.
How about <title ns="12"
ns-title="Help">Contents</title>?
- Mark Clements (HappyDog)
I think you could assume that any non-zero namespace has prefix so you'd
only need to split on the first ':' if it has a namespace number != 0
(this
assumes we will never setup a namespace with
':' in it).
BTW: why are you having so much trouble with this?
Personally - I have no trouble, as I don't use the dumps :)
I'm looking at it from a technical point of view - you need to be able to
unambiguously know the name of the page, ideally without requiring the
header information (thus allowing individual pages to be spat out and
manipulated in XML without requiring extra meta-data). If you cannot do
that, then I see it as a short-coming in the data schema.
If the following statements are all true, then there is no ambiguity. If
any of them are false then that ambiguity remains:
1) The main namespace can never have a prefix (ns title is always empty)
2) All other namespaces _must_ have a prefix
3) A namespace title cannot contain a colon
If they are all true then your assumption above is valid, otherwise it is
not and you will parse some edge-case titles incorrectly.
- Mark Clements (HappyDog)