Re: [Wikitech-l] Parsing database dumps

List overview All Threads
Download

newer

older

New Wikipedias created

problem with recent changes

Mike O

28 Sep 2006 28 Sep '06

11:01 p.m.

...

Mike wrote:

...look at this snippet of XML from enwiki-latest-pages-articles.xml.bz2 taken from late August. I don't see any namespace in the <title> elements, ... <title>AaA</title> ... <title>AlgeriA</title>

Those are main namespace (namespace 0) articles, so they don't have a prefix. But all the non-main-namespace pages do. The first is <id>724</id><title>Wikipedia:Adding Wikipedia articles to Nupedia</title>.)

Thanks! That explains a lot. Mike O -- _______________________________________________ Surf the Web in a faster, safer and easier way: Download Opera 9 at http://www.opera.com Powered by Outblaze

Show replies by date

Mark Clements

29 Sep 29 Sep

1:23 a.m.

New subject: Parsing database dumps

"Mike O" <mikeo(a)operamail.com> wrote in message news:20060928210104.E1668CA145@ws5-11.us4.outblaze.com...

...

> > Mike wrote: > > ...look at this snippet of XML from

enwiki-latest-pages-articles.xml.bz2

...

> > taken from late August. I don't see any namespace in the <title>

elements,

...

... <title>AaA</title> ... <title>AlgeriA</title>

Thanks! That explains a lot.

As a separate, but related question, why is the namespace not given as part of the page information? e.g. <title>Help:Contents</title> <namespace>12</namespace> <pagetitle>Contents</pagetitle> Surely this would be more useful when it comes to wider application? - Mark Clements (HappyDog)

Platonides

11:25 p.m.

New subject: Parsing database dumps

"Mark Clements" wrote:

...

I'd add it there as <title ns="12">Help:Contents</title> (undefined parameter meaning 'old xml version', not main namespace) Giving title, namespace and pagetitle is redundant and should be avoided. It can be several Mb for uncompressed dumps.

Mark Clements

1 Oct 1 Oct

3:18 a.m.

New subject: Parsing database dumps

"Platonides" <Platonides(a)gmail.com> wrote in message news:efk2vt$1gk$1@sea.gmane.org...

...

"Mark Clements" wrote:

I'd add it there as <title ns="12">Help:Contents</title> (undefined parameter meaning 'old xml version', not main namespace) Giving title, namespace and pagetitle is redundant and should be avoided.

...

can be several Mb for uncompressed dumps.

That's a pretty good solution, although one of the issues is that the title includes the namespace, which needs to be removed to get the actual page title. I feel that the <page> section should be complete in and of itself, without requiring the header section mapping namespace names to ids. Without knowing the mappings (ns to ns-title) that are present in the header, you cannot interpret the title unambiguosly, for example <title ns="0">Star Trek: The Next Generation</title> relies on the parser knowing that ns-0 is not called 'Star Trek' in order to be interpreted properly. How about <title ns="12" ns-title="Help">Contents</title>? - Mark Clements (HappyDog)

Platonides

1:40 p.m.

New subject: Parsing database dumps

...

I think you could assume that any non-zero namespace has prefix so you'd only need to split on the first ':' if it has a namespace number != 0 (this assumes we will never setup a namespace with ':' in it). BTW: why are you having so much trouble with this?

Mark Clements

7:21 p.m.

New subject: Parsing database dumps

"Platonides" <Platonides(a)gmail.com> wrote in message news:efo9g3$tee$1@sea.gmane.org...

...

> That's a pretty good solution, although one of the issues is that the > title > includes the namespace, which needs to be removed to get the actual page > title. I feel that the <page> section should be complete in and of

itself,

...

> without requiring the header section mapping namespace names to ids. > Without > knowing the mappings (ns to ns-title) that are present in the header,

you

...

cannot interpret the title unambiguosly, for example <title ns="0">Star Trek: The Next Generation</title> relies on the parser knowing that ns-0 is not called 'Star Trek' in order to be interpreted properly. How about <title ns="12" ns-title="Help">Contents</title>? - Mark Clements (HappyDog)

I think you could assume that any non-zero namespace has prefix so you'd only need to split on the first ':' if it has a namespace number != 0

(this

...

assumes we will never setup a namespace with ':' in it). BTW: why are you having so much trouble with this?

Personally - I have no trouble, as I don't use the dumps :) I'm looking at it from a technical point of view - you need to be able to unambiguously know the name of the page, ideally without requiring the header information (thus allowing individual pages to be spat out and manipulated in XML without requiring extra meta-data). If you cannot do that, then I see it as a short-coming in the data schema. If the following statements are all true, then there is no ambiguity. If any of them are false then that ambiguity remains: 1) The main namespace can never have a prefix (ns title is always empty) 2) All other namespaces _must_ have a prefix 3) A namespace title cannot contain a colon If they are all true then your assumption above is valid, otherwise it is not and you will parse some edge-case titles incorrectly. - Mark Clements (HappyDog)

Brion Vibber

8:29 p.m.

New subject: Parsing database dumps

Mark Clements wrote:

...

If the following statements are all true, then there is no ambiguity. If any of them are false then that ambiguity remains: 1) The main namespace can never have a prefix (ns title is always empty)

Yes.

...

2) All other namespaces _must_ have a prefix

Yes.

...

3) A namespace title cannot contain a colon

No.

...

If they are all true then your assumption above is valid, otherwise it is not and you will parse some edge-case titles incorrectly.

No. Generally, parsing full-text titles to (namespace, title) requires knowing two things: a) The set of all defined namespace prefixes b) The set of all defined interwiki prefixes For parsing page titles from a dump, you only care about the namespaces -- which are provided in the dump -- since no pages there can have an interwiki title. And of course, you only care about the namespaces *if you do* care about the namespaces, which you very often may not. -- brion vibber (brion @ pobox.com)

Jay R. Ashworth

8:48 p.m.

New subject: Parsing database dumps

On Sun, Oct 01, 2006 at 11:29:33AM -0700, Brion Vibber wrote:

...

Mark Clements wrote:

If the following statements are all true, then there is no ambiguity. If any of them are false then that ambiguity remains: 1) The main namespace can never have a prefix (ns title is always empty)

Yes.

2) All other namespaces _must_ have a prefix

Yes.

3) A namespace title cannot contain a colon

No.

Is there *any reason at all* that this stricture cannot be imposed post-hoc (IE: now)? It would seem to make lots of things lots of easier, with almost no real-world impact. We have lots of wikis in lots of languages. Are there *ANY* namespaces in any language with an ASCII colon as a valid character in their name? Can we think of any other reason not to impose such a stricture? Cheers, -- jra -- Jay R. Ashworth jra(a)baylink.com Designer Baylink RFC 2100 Ashworth & Associates The Things I Think '87 e24 St Petersburg FL USA http://baylink.pitas.com +1 727 647 1274 "That's women for you; you divorce them, and 10 years later, they stop having sex with you." -- Jennifer Crusie; _Fast_Women_

Brion Vibber

11:59 p.m.

New subject: Parsing database dumps

Jay R. Ashworth wrote:

...

On Sun, Oct 01, 2006 at 11:29:33AM -0700, Brion Vibber wrote:

Mark Clements wrote:

If the following statements are all true, then there is no ambiguity. If any of them are false then that ambiguity remains: 1) The main namespace can never have a prefix (ns title is always empty)

Yes.

2) All other namespaces _must_ have a prefix

Yes.

3) A namespace title cannot contain a colon

No.

Is there *any reason at all* that this stricture cannot be imposed post-hoc (IE: now)?

Er, actually that should be "yes". At least, it would probably break a lot if you tried to do that. I misread the question. -- brion vibber (brion @ pobox.com)

6410

days inactive

6413

days old

wikitech-l@lists.wikimedia.org

Manage subscription

8 comments

5 participants

tags (0)

participants (5)

Brion Vibber
Jay R. Ashworth
Mark Clements
Mike O
Platonides