Mike wrote:
...look at this snippet of XML from enwiki-latest-pages-articles.xml.bz2 taken from late August. I don't see any namespace in the <title> elements, ... <title>AaA</title> ... <title>AlgeriA</title>
Those are main namespace (namespace 0) articles, so they don't have a prefix. But all the non-main-namespace pages do. The first is <id>724</id><title>Wikipedia:Adding Wikipedia articles to Nupedia</title>.)
Thanks! That explains a lot.
Mike O
"Mike O" mikeo@operamail.com wrote in message news:20060928210104.E1668CA145@ws5-11.us4.outblaze.com...
Mike wrote:
...look at this snippet of XML from
enwiki-latest-pages-articles.xml.bz2
taken from late August. I don't see any namespace in the <title>
elements,
... <title>AaA</title> ... <title>AlgeriA</title>
Those are main namespace (namespace 0) articles, so they don't have a prefix. But all the non-main-namespace pages do. The first is <id>724</id><title>Wikipedia:Adding Wikipedia articles to Nupedia</title>.)
Thanks! That explains a lot.
As a separate, but related question, why is the namespace not given as part of the page information?
e.g. <title>Help:Contents</title> <namespace>12</namespace> <pagetitle>Contents</pagetitle>
Surely this would be more useful when it comes to wider application?
- Mark Clements (HappyDog)
"Mark Clements" wrote:
As a separate, but related question, why is the namespace not given as part of the page information?
e.g.
<title>Help:Contents</title> <namespace>12</namespace> <pagetitle>Contents</pagetitle>
Surely this would be more useful when it comes to wider application?
- Mark Clements (HappyDog)
I'd add it there as <title ns="12">Help:Contents</title> (undefined parameter meaning 'old xml version', not main namespace) Giving title, namespace and pagetitle is redundant and should be avoided. It can be several Mb for uncompressed dumps.
"Platonides" Platonides@gmail.com wrote in message news:efk2vt$1gk$1@sea.gmane.org...
"Mark Clements" wrote:
As a separate, but related question, why is the namespace not given as part of the page information?
e.g.
<title>Help:Contents</title> <namespace>12</namespace> <pagetitle>Contents</pagetitle>
Surely this would be more useful when it comes to wider application?
- Mark Clements (HappyDog)
I'd add it there as <title ns="12">Help:Contents</title> (undefined parameter meaning 'old xml version', not main namespace) Giving title, namespace and pagetitle is redundant and should be avoided.
It
can be several Mb for uncompressed dumps.
That's a pretty good solution, although one of the issues is that the title includes the namespace, which needs to be removed to get the actual page title. I feel that the <page> section should be complete in and of itself, without requiring the header section mapping namespace names to ids. Without knowing the mappings (ns to ns-title) that are present in the header, you cannot interpret the title unambiguosly, for example <title ns="0">Star Trek: The Next Generation</title> relies on the parser knowing that ns-0 is not called 'Star Trek' in order to be interpreted properly.
How about <title ns="12" ns-title="Help">Contents</title>?
- Mark Clements (HappyDog)
That's a pretty good solution, although one of the issues is that the title includes the namespace, which needs to be removed to get the actual page title. I feel that the <page> section should be complete in and of itself, without requiring the header section mapping namespace names to ids. Without knowing the mappings (ns to ns-title) that are present in the header, you cannot interpret the title unambiguosly, for example <title ns="0">Star Trek: The Next Generation</title> relies on the parser knowing that ns-0 is not called 'Star Trek' in order to be interpreted properly.
How about <title ns="12" ns-title="Help">Contents</title>?
- Mark Clements (HappyDog)
I think you could assume that any non-zero namespace has prefix so you'd only need to split on the first ':' if it has a namespace number != 0 (this assumes we will never setup a namespace with ':' in it).
BTW: why are you having so much trouble with this?
"Platonides" Platonides@gmail.com wrote in message news:efo9g3$tee$1@sea.gmane.org...
That's a pretty good solution, although one of the issues is that the title includes the namespace, which needs to be removed to get the actual page title. I feel that the <page> section should be complete in and of
itself,
without requiring the header section mapping namespace names to ids. Without knowing the mappings (ns to ns-title) that are present in the header,
you
cannot interpret the title unambiguosly, for example <title ns="0">Star Trek: The Next Generation</title> relies on the parser knowing that ns-0 is not called 'Star Trek' in order to be interpreted properly.
How about <title ns="12" ns-title="Help">Contents</title>?
- Mark Clements (HappyDog)
I think you could assume that any non-zero namespace has prefix so you'd only need to split on the first ':' if it has a namespace number != 0
(this
assumes we will never setup a namespace with ':' in it).
BTW: why are you having so much trouble with this?
Personally - I have no trouble, as I don't use the dumps :)
I'm looking at it from a technical point of view - you need to be able to unambiguously know the name of the page, ideally without requiring the header information (thus allowing individual pages to be spat out and manipulated in XML without requiring extra meta-data). If you cannot do that, then I see it as a short-coming in the data schema.
If the following statements are all true, then there is no ambiguity. If any of them are false then that ambiguity remains:
1) The main namespace can never have a prefix (ns title is always empty) 2) All other namespaces _must_ have a prefix 3) A namespace title cannot contain a colon
If they are all true then your assumption above is valid, otherwise it is not and you will parse some edge-case titles incorrectly.
- Mark Clements (HappyDog)
Mark Clements wrote:
If the following statements are all true, then there is no ambiguity. If any of them are false then that ambiguity remains:
- The main namespace can never have a prefix (ns title is always empty)
Yes.
- All other namespaces _must_ have a prefix
Yes.
- A namespace title cannot contain a colon
No.
If they are all true then your assumption above is valid, otherwise it is not and you will parse some edge-case titles incorrectly.
No.
Generally, parsing full-text titles to (namespace, title) requires knowing two things: a) The set of all defined namespace prefixes b) The set of all defined interwiki prefixes
For parsing page titles from a dump, you only care about the namespaces -- which are provided in the dump -- since no pages there can have an interwiki title.
And of course, you only care about the namespaces *if you do* care about the namespaces, which you very often may not.
-- brion vibber (brion @ pobox.com)
On Sun, Oct 01, 2006 at 11:29:33AM -0700, Brion Vibber wrote:
Mark Clements wrote:
If the following statements are all true, then there is no ambiguity. If any of them are false then that ambiguity remains:
- The main namespace can never have a prefix (ns title is always empty)
Yes.
- All other namespaces _must_ have a prefix
Yes.
- A namespace title cannot contain a colon
No.
Is there *any reason at all* that this stricture cannot be imposed post-hoc (IE: now)?
It would seem to make lots of things lots of easier, with almost no real-world impact.
We have lots of wikis in lots of languages. Are there *ANY* namespaces in any language with an ASCII colon as a valid character in their name?
Can we think of any other reason not to impose such a stricture?
Cheers, -- jra
Jay R. Ashworth wrote:
On Sun, Oct 01, 2006 at 11:29:33AM -0700, Brion Vibber wrote:
Mark Clements wrote:
If the following statements are all true, then there is no ambiguity. If any of them are false then that ambiguity remains:
- The main namespace can never have a prefix (ns title is always empty)
Yes.
- All other namespaces _must_ have a prefix
Yes.
- A namespace title cannot contain a colon
No.
Is there *any reason at all* that this stricture cannot be imposed post-hoc (IE: now)?
Er, actually that should be "yes". At least, it would probably break a lot if you tried to do that. I misread the question.
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org