the problem with languages being separate from namespaces

List overview All Threads
Download

newer

older

RE: [Wikitech-l] Languages &...

Speeding up the database (was:...

Jonathan Walther

9 Dec 2002 9 Dec '02

2:53 a.m.

I have a goal for the design. My goal is that, once a Wikilink is exploded based on the ':' character, I can do simple database lookups, and append the results together to get the target URL without having to do any parsing on the data I looked up.

For instance, let us take a link Talk:foo

I can explode this into Talk and foo.

Each namespace has an "urlprefix". For Talk, the urlprefix would be http://www.wikipedia.org/Talk

Now, let's add a (separate) language into the mix. en:Talk:foo

I can explode it into en, Talk, and foo.

Now, the urlprefix for Talk is the same. How do I say that it is an english language page? The normal, standard way to do this would be like so:

http://www.wikipedia.org/en/Talk/foo

That feels "right" to me. But doing that would require parsing the urlprefix for the namespace to figure out where to put in the language. I don't want to do that, and don't feel I should have to.

If languages are namespaces it is easy: I can make the namespace en_Talk have an urlprefix of http://www.wikipedia.org/en/Talk

However, Brion, you've made a convincing case that we do need to know the language. I have no problem having another field for each namespace, giving that namespaces language to give to browsers.

Can you think of a better way to do this?

Jonathan

-- Geek House Productions, Ltd. Providing Unix & Internet Contracting and Consulting, QA Testing, Technical Documentation, Systems Design & Implementation, General Programming, E-commerce, Web & Mail Services since 1998 Phone: 604-435-1205 Email: djw@reactor-core.org Webpage: http://reactor-core.org Address: 2459 E 41st Ave, Vancouver, BC V5R2W2

Attachments:

attachment.sig (application/pgp-signature — 307 bytes)

Show replies by date

The Cunctator

9 Dec 9 Dec

3:33 a.m.

New subject: the problem with languages being separate from namespaces

On Mon, 2002-12-09 at 03:53, Jonathan Walther wrote:

...

I have a goal for the design. My goal is that, once a Wikilink is exploded based on the ':' character, I can do simple database lookups, and append the results together to get the target URL without having to do any parsing on the data I looked up.

For instance, let us take a link Talk:foo

I can explode this into Talk and foo.

Each namespace has an "urlprefix". For Talk, the urlprefix would be http://www.wikipedia.org/Talk

Now, let's add a (separate) language into the mix. en:Talk:foo

I can explode it into en, Talk, and foo.

Now, the urlprefix for Talk is the same. How do I say that it is an english language page? The normal, standard way to do this would be like so:

http://www.wikipedia.org/en/Talk/foo

That feels "right" to me. But doing that would require parsing the urlprefix for the namespace to figure out where to put in the language. I don't want to do that, and don't feel I should have to.

You're also making the slash back into a magic character, which was part of the whole problem. If someone creates a page named [[Talk/Listen]] for some reason (say a film comes out with that title), then your scheme would cause a collision.

Brion Vibber

3:39 a.m.

New subject: the problem with languages being separate from namespaces

On Mon, 2002-12-09 at 00:53, Jonathan Walther wrote: [..]

...

Now, the urlprefix for Talk is the same. How do I say that it is an english language page? The normal, standard way to do this would be like so:

http://www.wikipedia.org/en/Talk/foo

Hmm, I'm not a big fan of the slashes between namespace and title; Slashes imply a hierarchical path-like structure which certainly isn't there for talk pages.

...

That feels "right" to me. But doing that would require parsing the urlprefix for the namespace to figure out where to put in the language. I don't want to do that, and don't feel I should have to.

If you must do it that way, keying the url prefix table on language *and* namespace should do the trick.

...

If languages are namespaces it is easy: I can make the namespace en_Talk have an urlprefix of http://www.wikipedia.org/en/Talk

It's unclear to me what it would entail to 'make the namespace en_Talk'. The link syntax is [[Talk:Foo]] (if in the en language section) or [[en:Talk:Foo]] (fully qualified or from another section), and you're talking about a URL that is named differently. Is the 'en_' prefix just a shorthand for squishing language and namespace into one database field?

...

However, Brion, you've made a convincing case that we do need to know the language. I have no problem having another field for each namespace, giving that namespaces language to give to browsers.

Can you think of a better way to do this?

How does your proposal mark namespaces by functionality? (ie, given a talk page or user page in some language, how do I know it's a talk page [and should be linked to a subject page] or a user page [and should be linked to a contribs list], etc)

-- brion vibber (brion @ pobox.com)

Jonathan Walther

3:57 a.m.

On Mon, Dec 09, 2002 at 01:39:17AM -0800, Brion Vibber wrote:

...

It's unclear to me what it would entail to 'make the namespace en_Talk'. The link syntax is [[Talk:Foo]] (if in the en language section) or [[en:Talk:Foo]] (fully qualified or from another section), and you're talking about a URL that is named differently. Is the 'en_' prefix just a shorthand for squishing language and namespace into one database field?

To answer your question: yes. But here is a question *I* should have asked ages ago: what constraints should be put on page titles? None whatsoever? You may not like letting namespaces be separated with a slash, but I believe most people do feel comfortable with the convention of making the language go first, and be separated with a slash.

Also, is Talk:en:Foo valid? I hope it isn't.

Can we abolish use of ':' in page titles so that it is totally reserved for languages and namespaces?

Jonathan

Brion Vibber

4:35 a.m.

New subject: the problem with languages being separate from namespaces

On Mon, 2002-12-09 at 01:57, Jonathan Walther wrote:

...

On Mon, Dec 09, 2002 at 01:39:17AM -0800, Brion Vibber wrote:

...
It's unclear to me what it would entail to 'make the namespace en_Talk'. The link syntax is [[Talk:Foo]] (if in the en language section) or [[en:Talk:Foo]] (fully qualified or from another section), and you're talking about a URL that is named differently. Is the 'en_' prefix just a shorthand for squishing language and namespace into one database field?

To answer your question: yes.

Which raises the question of what purpose the tag serves? We still have to distinguish languages and classes of namespaces, which means either constantly parsing that tag into two pieces or constantly referencing a namespace table with entries like: "en_Talk" -> "en", "Talk", NSCLASS_TALK

...

But here is a question *I* should have asked ages ago: what constraints should be put on page titles? None whatsoever?

At present the following characters are allowed in titles: [-,.()' &;%!?_0-9A-Za-z/:\x80-\xFF]

This means the following ASCII chars are _not_ allowed: \x00-\x1f (control codes, must not use) " double quote # hash $ dollar * ampersand + plus < less than = equals

...

greater than

[ ] brackets (reserved for links, must not use) \ backslash ^ caret { } curly braces | pipe (reserved for links, must not use) ~ tilde

Some of these are illegal or special in URLs, but they can of course be hex-encoded as high characters and ampersands are.

(And of course underscore and space are folded together; internally underscore is used, and prefered in URLs; spaces are used for display and preferred in text and wikilinks.)

...

You may not like letting namespaces be separated with a slash, but I believe most people do feel comfortable with the convention of making the language go first, and be separated with a slash.

Heck, I suggested that myself: http://meta.wikipedia.org/wiki/Thoughts_on_language_integration

...

Also, is Talk:en:Foo valid? I hope it isn't.

It might pass the parser at the moment, but it probably shouldn't.

...

Can we abolish use of ':' in page titles so that it is totally reserved for languages and namespaces?

We explicitly decided to enable it a few months ago because it's very common in titles of works -- [[Three Colors: Blue]] [[2001: A Space Odyssey]] etc

-- brion vibber (brion @ pobox.com)

Andre Engels

6:08 a.m.

...

At present the following characters are allowed in titles: [-,.()' &;%!?_0-9A-Za-z/:\x80-\xFF]

Please remove '&' from this list: links to a page with '&' in it do not go right; they go to just the part before the '&'. So a page with '&' in the title is for all practical purposes unreachable.

Andre Engels

Mark Wojtowicz

10:28 a.m.

Andre Engels scribed:

...

...
At present the following characters are allowed in titles: [-,.()' &;%!?_0-9A-Za-z/:\x80-\xFF]

Please remove '&' from this list: links to a page with '&' in it do not go right; they go to just the part before the '&'. So a page with '&' in the title is for all practical purposes unreachable.

When it's in the form http://www.wikipedia.org/w/wiki.phtml?title=AT&T '&' doesn't work.

But these work: http://www.wikipedia.org/wiki/AT&T http://www.wikipedia.org/w/wiki.phtml?title=AT%26T

-- Mark

Jonathan Walther

10:27 a.m.

On Mon, Dec 09, 2002 at 11:28:38AM -0500, Mark Wojtowicz wrote:

...

Andre Engels scribed:

...
...
At present the following characters are allowed in titles: [-,.()' &;%!?_0-9A-Za-z/:\x80-\xFF]

If ? is in that list, how can we support the following?

http://www.wikipedia.org/wiki/foo?lang=en http://www.wikipedia.org/wiki/foo?lang=en&action=edit

How do we know that the title of the page is not foo?lang=en instead of being "foo", and having the argument lang=en applied to it?

I really don't like the ':' separator in the URL section for namespaces, because it shows up as a hex encoded string, instead of ':' in my urlbar.

Can we find some other syntax (not necessarily syntax within the [[]]) for accessing pages and specifying their namespace in the URL?

Jonathan

Magnus Manske

11:04 a.m.

Jonathan Walther wrote:

...

I really don't like the ':' separator in the URL section for namespaces, because it shows up as a hex encoded string, instead of ':' in my urlbar.

Can we find some other syntax (not necessarily syntax within the [[]]) for accessing pages and specifying their namespace in the URL?

Removing the : from the wikipedia system would not only mean all wikipedians having to learn new syntax, but also to move and redirect a zillion entries, not to speak of fixing these links on countless pages.

And even if we just make some "URL wrapper" for namespaces, I don't see the advantage. If you want to unite all language databases in a single one, why change the syntax? Just add an extra field to the table. "en:talk:stuff" will do just fine IMHO.

Magnus

Jonathan Walther

11:16 a.m.

On Mon, Dec 09, 2002 at 06:04:58PM +0100, Magnus Manske wrote:

...

Removing the : from the wikipedia system would not only mean all wikipedians having to learn new syntax, but also to move and redirect a zillion entries, not to speak of fixing these links on countless pages.

And even if we just make some "URL wrapper" for namespaces, I don't see the advantage. If you want to unite all language databases in a single one, why change the syntax? Just add an extra field to the table. "en:talk:stuff" will do just fine IMHO.

1) I want to see some sort of specification of what is allowed for language names, and namespace names.

2) I want to see how languages, namespaces, and article titles are supposed to map onto URI's that are NOT ambiguous.

3) Maybe I wasn't clear enough; the Wiki syntax wouldn't change; what I want is a way to map all links onto an URL without ambiguity.

Jonathan

Brion Vibber

3:46 p.m.

New subject: the problem with languages being separate from namespaces

On Mon, 2002-12-09 at 08:27, Jonathan Walther wrote:

...

If ? is in that list, how can we support the following?

http://www.wikipedia.org/wiki/foo?lang=en http://www.wikipedia.org/wiki/foo?lang=en&action=edit

(At the moment, that wouldn't work, as we overwrite the query string when using the /wiki/foo alias; you'd just get /w/wiki.phtml?title=foo. However, it probably should work and it's easy to tweak the apache config for it.)

...

How do we know that the title of the page is not foo?lang=en instead of being "foo", and having the argument lang=en applied to it?

Because the page [[foo?lang=en]] would be http://www.wikipedia.org/wiki/foo%3Flang%3Den

...

I really don't like the ':' separator in the URL section for namespaces, because it shows up as a hex encoded string, instead of ':' in my urlbar.

Your browser, like Mozilla, has a greater affinity for the hex codes and likes to display them even when it doesn't have to. What can I say.

...

Can we find some other syntax (not necessarily syntax within the [[]]) for accessing pages and specifying their namespace in the URL?

Why?

-- brion vibber (brion @ pobox.com)

Toby Bartels

12:10 p.m.

Clutch wrote:

...

For instance, let us take a link Talk:foo I can explode this into Talk and foo. Each namespace has an "urlprefix". For Talk, the urlprefix would be http://www.wikipedia.org/Talk

...

Now, let's add a (separate) language into the mix. en:Talk:foo I can explode it into en, Talk, and foo. Now, the urlprefix for Talk is the same. How do I say that it is an english language page? The normal, standard way to do this would be like so: http://www.wikipedia.org/en/Talk/foo That feels "right" to me. But doing that would require parsing the urlprefix for the namespace to figure out where to put in the language. I don't want to do that, and don't feel I should have to.

You have to parse the namespace anyway, to see if it's really a namespace. Remember, colons are perfectly acceptable in article titles, and they *don't* indicate namespaces *or* languages -- except in a few special cases. (Consider [[en:E. coli O157:H7]].)

So in order to parse a link correctly, we need to do these steps: * Decide which language it is: * Is there a colon? Y * Take the string up to the first colon; * Is this string a language code? Y * Drop this bit from the text of the link; * That string indicates the language. N * It's the current language. N * It's the current language. * Decide which namespace it is: * Is there a colon now? Y * Take the string up to the first colon; * Is this string a namespace *in*the*relevant*language*? Y * Drop this bit from the text of the link; * That string indicates the namespace. N * It's the main namespace. N * It's the main namespace. * Decide what the page title is: * Take the rest of the string; * That's the title.

The algorithm is a bit more complicated than this, because of: * Special meanings when a colon begins the link. * The pipe trick (and it's use of both namespaces and parentheses). * Error handling (when the link has forbidden characters). But the above 3 steps must all be done, and in that order, or we'll break functionality of certain links like the E coli one above.

-- Toby

8041

Age (days ago)

8041

Last active (days ago)

wikitech-l@lists.wikimedia.org

11 comments

7 participants

tags (0)

participants (7)

Andre Engels
Brion Vibber
Jonathan Walther
Magnus Manske
Mark Wojtowicz
The Cunctator
Toby Bartels