How long can a title be? Database issue.

Update of fr.wikinews interwiki...

Richard Holton

2 Mar 2005 2 Mar '05

11:35 p.m.

In looking through the MediaWiki schema (both current and new), I've noticed that page titles are given a max length of 255 chars. However, it seems that in some cases, this title includes the namespace, and in others it does not -- namespace is stored as an Int or is implied by context (eg, categorylinks, imagelinks).

This sounds a warning klaxon in my head. Is my concern misplaced? Would it make sense to always use the namespace (as an int) plus title (without namespace) to reference a page (again, unless implied by context)?

Rich Holton en.wikipedia:User:Rholton

Show replies by date

Rowan Collins

3 Mar 3 Mar

12:14 a.m.

On Wed, 2 Mar 2005 13:35:06 -0600, Richard Holton richholton@gmail.com wrote:

...

In looking through the MediaWiki schema (both current and new), I've noticed that page titles are given a max length of 255 chars. However, it seems that in some cases, this title includes the namespace, and in others it does not -- namespace is stored as an Int or is implied by context (eg, categorylinks, imagelinks).

Hm, I see what you mean - there aren't that many places where it's a problem, but certainly 'brokenlinks' has the namespace as part of the [destination] title. So it seems an article could have 255 characters + a namespace (because the namespace isn't considered part of the title) and not fit in brokenlinks (because that just stores the text of the link, rather than a namespace and title).

There's been talk of merging the various links tables to all be id->name (rather than some being id->id), because the text of the link doesn't change, but the article it refers to might. This problem could be addressed by anyone implementing that.

The disadvantage of using {namespace_as_int, title_as_text} for link targets is that this doesn't reflect how they're entered: [[Foo:Bar]] could change in meaning from {0, "Foo:Bar"} to {20, "Bar"} if a custom "Foo" namespace was created; the two forms could not, however, co-exist. This suggests to me that it would be better to just make the link_to field wider than page_title (i.e. a width of 255 + a constant MAX_NAMESPACE_LENGTH), and retain the current practice of storing the destination as one string.

-- Rowan Collins BSc [IMSoP]

Gerard Meijssen

2 a.m.

Rowan Collins wrote:

...

On Wed, 2 Mar 2005 13:35:06 -0600, Richard Holton richholton@gmail.com wrote:

...
In looking through the MediaWiki schema (both current and new), I've noticed that page titles are given a max length of 255 chars. However, it seems that in some cases, this title includes the namespace, and in others it does not -- namespace is stored as an Int or is implied by context (eg, categorylinks, imagelinks).

Hm, I see what you mean - there aren't that many places where it's a problem, but certainly 'brokenlinks' has the namespace as part of the [destination] title. So it seems an article could have 255 characters

a namespace (because the namespace isn't considered part of the

title) and not fit in brokenlinks (because that just stores the text of the link, rather than a namespace and title).

There's been talk of merging the various links tables to all be id->name (rather than some being id->id), because the text of the link doesn't change, but the article it refers to might. This problem could be addressed by anyone implementing that.

The disadvantage of using {namespace_as_int, title_as_text} for link targets is that this doesn't reflect how they're entered: [[Foo:Bar]] could change in meaning from {0, "Foo:Bar"} to {20, "Bar"} if a custom "Foo" namespace was created; the two forms could not, however, co-exist. This suggests to me that it would be better to just make the link_to field wider than page_title (i.e. a width of 255 + a constant MAX_NAMESPACE_LENGTH), and retain the current practice of storing the destination as one string.

Does the restriction for 250 characters apply to the number of character in the URL or in the number of characters that make up the name of the article eg http://commons.wikimedia.org/wiki/Image:Hy-%D5%94%D6%80%D5%AB%D5%BD%D5%BF%D5... which is Image:Hy-Քրիստոս ծնավ և հայտնեցավ, մեզի, ձեզի Սուրբ ավետիս.ogg

Thanks, GerardM

PS No i did not count them

Richard Holton

3:46 a.m.

On Wed, 02 Mar 2005 23:00:41 +0100, Gerard Meijssen gerard.meijssen@gmail.com wrote:

...

Does the restriction for 250 characters apply to the number of character in the URL or in the number of characters that make up the name of the article eg http://commons.wikimedia.org/wiki/Image:Hy-%D5%94%D6%80%D5%AB%D5%BD%D5%BF%D5... which is Image:Hy-Քրիստոս ծնավ և հայտնեցավ, մեզի, ձեզի Սուրբ ավետիս.ogg

As far as I can tell, it only applies to the article/page title, not the formed URL

-Rich Holton en.wikipedia:User:Rholton

Richard Holton

3:29 a.m.

On Wed, 2 Mar 2005 20:14:25 +0000, Rowan Collins rowan.collins@gmail.com wrote:

...

On Wed, 2 Mar 2005 13:35:06 -0600, Richard Holton richholton@gmail.com wrote:

...
In looking through the MediaWiki schema (both current and new), I've noticed that page titles are given a max length of 255 chars. However, it seems that in some cases, this title includes the namespace, and in others it does not -- namespace is stored as an Int or is implied by context (eg, categorylinks, imagelinks).

Hm, I see what you mean - there aren't that many places where it's a problem, but certainly 'brokenlinks' has the namespace as part of the [destination] title. So it seems an article could have 255 characters

a namespace (because the namespace isn't considered part of the

title) and not fit in brokenlinks (because that just stores the text of the link, rather than a namespace and title).

There's been talk of merging the various links tables to all be id->name (rather than some being id->id), because the text of the link doesn't change, but the article it refers to might. This problem could be addressed by anyone implementing that.

The disadvantage of using {namespace_as_int, title_as_text} for link targets is that this doesn't reflect how they're entered: [[Foo:Bar]] could change in meaning from {0, "Foo:Bar"} to {20, "Bar"} if a custom "Foo" namespace was created; the two forms could not, however, co-exist. This suggests to me that it would be better to just make the link_to field wider than page_title (i.e. a width of 255 + a constant MAX_NAMESPACE_LENGTH), and retain the current practice of storing the destination as one string.

I notice that in the new schema, the 'page' table uses the {namespace_as_int, title_as_text} form, and it doesn't save the namespace within the title. (Was that true of the old schema as well?)

I don't want to second-guess the new schema. It does seem that the link tables should use the same method of identifying pages as the 'page' table does.

For 'categorylinks', having the namespace in the index would allow fast separation by namespace.

-Rich Holton en.wikipedia:User:Rholton

Brion Vibber

4:48 a.m.

...

On Wed, 2 Mar 2005 20:14:25 +0000, Rowan Collins rowan.collins@gmail.com wrote:

...
The disadvantage of using {namespace_as_int, title_as_text} for link targets is that this doesn't reflect how they're entered: [[Foo:Bar]] could change in meaning from {0, "Foo:Bar"} to {20, "Bar"} if a custom "Foo" namespace was created; the two forms could not, however, co-exist.

Namespace creation is a rare event which requires administrative work. Filtering by namespace on the other hand happens constantly.

...

...
This suggests to me that it would be better to just make the link_to field wider than page_title (i.e. a width of 255 + a constant MAX_NAMESPACE_LENGTH), and retain the current practice of storing the destination as one string.

This would make it very difficult to use the links table for anything; for instance Special:Recentchangeslinked would no longer be possible.

Richard Holton wrote:

...

I notice that in the new schema, the 'page' table uses the {namespace_as_int, title_as_text} form, and it doesn't save the namespace within the title. (Was that true of the old schema as well?)

Yes. It's been that way for years.

...

I don't want to second-guess the new schema. It does seem that the link tables should use the same method of identifying pages as the 'page' table does.

For 'categorylinks', having the namespace in the index would allow fast separation by namespace.

Lemme summarize the situation:

We have four link tables currently: links, brokenlinks, imagelinks, and categorylinks.

links is from id->target id brokenlinks is from id->(text target namespace+title) imagelinks is from id->target title [namespace is 6 by definition] categorylinks is from id->target title [namespace is 14 by definition]

In all, the 'from' is a key on page_id (cur_id in old schema) which uniquely identifies the page doing the linking. This number persists across page renaming.

In imagelinks and categorylinks, the target title can be used in conjunction with the hardcoded namespace to join to page/cur for the target.

In brokenlinks, the target is ugly ugly text. This can't be used in any joins. It should be changed to (namespace,title) but we are too lazy and this hasn't been done yet. There is the additional problem that the size limit of the field isn't 100% correct so there might be inconsistencies with long titles.

In links, the target is a page_id/cur_id number, and can be used to do joins. BUT, since linking is done by *name*, not by number, a creation/deletion/renaming of the target page will break this entry. Thus we have to clean up links, and shuffle pages around between links and brokenlinks when these things happen.

This kind of updating can be a burden on the database during operations on heavily-linked pages, so it's something we scalability-conscious folk want to eliminate.

-- brion vibber (brion @ pobox.com)

Richard Holton

6:14 a.m.

On Wed, 02 Mar 2005 16:48:51 -0800, Brion Vibber brion@pobox.com wrote:

...

Lemme summarize the situation:

We have four link tables currently: links, brokenlinks, imagelinks, and categorylinks.

links is from id->target id brokenlinks is from id->(text target namespace+title) imagelinks is from id->target title [namespace is 6 by definition] categorylinks is from id->target title [namespace is 14 by definition]

In all, the 'from' is a key on page_id (cur_id in old schema) which uniquely identifies the page doing the linking. This number persists across page renaming.

In imagelinks and categorylinks, the target title can be used in conjunction with the hardcoded namespace to join to page/cur for the target.

In brokenlinks, the target is ugly ugly text. This can't be used in any joins. It should be changed to (namespace,title) but we are too lazy and this hasn't been done yet. There is the additional problem that the size limit of the field isn't 100% correct so there might be inconsistencies with long titles.

In links, the target is a page_id/cur_id number, and can be used to do joins. BUT, since linking is done by *name*, not by number, a creation/deletion/renaming of the target page will break this entry. Thus we have to clean up links, and shuffle pages around between links and brokenlinks when these things happen.

This kind of updating can be a burden on the database during operations on heavily-linked pages, so it's something we scalability-conscious folk want to eliminate.

Thanks for the clarity, Brion. If I understand things correctly, we should look at eventually changing 'links' and 'broken links' to use: from id->(to namespace, to title).

Would there be a need for separate tables then? Such a configuration would require a lookup on the page table for each link when rendering a page. Although, now that I think about it, that's probably required now anyway.

Categorylinks seems to be a bit of a special case, since it really amounts to a reverse link. Adding [[category:foo]] to page 'bar' effectively adds a link to 'bar' on category:foo. At the same time, there can be broken category links on a page, but not broken page links on the category listing.

At the moment, I'm thinking of adding a "fromNamespace" field to categorylinks. This will decouple the display of pages from the display of parent categories, and would facilitate breaking the category display into namespaces.

-Rich Holton en.wikipedia:User:Rholton

Rowan Collins

5:47 p.m.

On Wed, 2 Mar 2005 20:14:43 -0600, Richard Holton richholton@gmail.com wrote:

...

Thanks for the clarity, Brion.

Firstly, let me second that thanks. And thanks also for pointing out the relative unhelpfulness of storing the target as one text string. It now occurs to me that if a new namespace (or a new interwiki prefix) is created, you'd just have to search for {ns=0, title=/Foo:.*/} instead of {target=/Foo:.*/}, so it really makes no odds.

...

If I understand things correctly, we should look at eventually changing 'links' and 'broken links' to use: from id->(to namespace, to title).

Would there be a need for separate tables then?

No, almost certainly not. In fact, it would be very helpful to create a unified table, because then we wouldn't need the proliferation of task-specific tables. See, for instance, http://bugzilla.wikimedia.org/show_bug.cgi?id=1065

...

Such a configuration would require a lookup on the page table for each link when rendering a page. Although, now that I think about it, that's probably required now anyway.

Yes, in order to determine whether a link is "broken" or not, you have to look for the page in the cur/page table. The only difference is that right now it gets stored in a different table if its broken, rather than just having a link_is_broken flag which can be checked and toggled when necessary.

...

Categorylinks seems to be a bit of a special case, since it really amounts to a reverse link.

That's one way of thinking about it. The other, which is more in tune with less formalised wikis (e.g. http://c2.com/cgi/wiki?WikiCategories), is that the category page is just a souped-up "What links here" display, and the links on individual pages just internal links displayed out-of-line. On this view, they're just another kind of link - distinguished only to solve the "Use-mention problem" (http://www.usemod.com/cgi-bin/mb.pl?UseMentionProblem)

...

At the moment, I'm thinking of adding a "fromNamespace" field to categorylinks. This will decouple the display of pages from the display of parent categories, and would facilitate breaking the category display into namespaces.

Hm, that certainly does match your "reverse link" interpretation. Of course, it suffers from a variant of the update problem that the current id->id based links table does: a page can be moved from one namespace to another, taking its category links with it. So, if [[Foo]] contains [[Category:Metasyntactic variables]], and I move it to [[Wikipedia:Foo]], the from_namespace field in the categorylinks table would become wrong, even though the actual page reference (by ID) would remain valid. I don't know whether that update would be more or less sensible/expensive/complex than having to do an extra lookup/join/whatever to get the namespace out of the page table when displaying the category.

There's an additional problem with merging the categorylinks table in with the others anyway, in that it has extra fields cl_sortkey and cl_timestamp that the others don't have or need. That's a shame, because it would be neat, if redesigning the tables, not to have to leave that one out.

Some rambling thoughts, though, on a possible links schema: * l_from (as page_id) * l_to_ns (as int) * l_to_title (as string)

* l_is_broken (boolean) --- to allow everything that brokenlinks currently does; this should be kept seperate from the link type, because if you try to {{include}} a non-existent page, it's both "broken" and a "template" link, and both facts are potentially useful

* l_type (probably an int, like namespace) --- values could include: --- LT_NORMAL: just an ordinary [[free link]] (or an escaped link to a [[:Image:Page]] or [[:Category:Page]]) --- LT_IMAGE: it's an image display (l_to_ns would be redundant, but never mind) --- LT_TEMPLATE: it's a {{template inclusion}} --- LT_INTERWIKI: currently, interwiki links don't get stored anywhere, but it could be useful if they were; for instance, if someone changes the interwiki map, it might be worth looking up what links are affected. Obviously, l_to_ns would be meaningless, and any search for a specific prefix would involve matching the text, but might it be worth considering? --- LT_CATEGORY: as I say, this has rather awkward issues - where do the extra fields go? l_to_ns could be reused [abused] as the namespace of the *originating* page if keeping that up-to-date was considered better than looking it up in the page table

Using the links table for redirects --- this is another idea I had which may or may not have merit. At the moment, we can know a page is a redirect from one flag in the page table, but have to retrieve and parse its content to find what it's a redirect *to*. Given that this now requires a look-up in the revision table, it would be just as easy to look up the destination in the link table instead. The only times this wouldn't work would be redirects to Interwiki links (unless we start storing those) and redirects to Special pages (which are already kind of crazy, but certainly don't show up in the links table right now; they'd have to be their own link type if they did, I guess). You could always have a system that *fell back to* getting and parsing the revision, but maybe that would be kind of silly. --- I was originally going to suggest an 'l_is_redirect' field, but now I think about it, maybe we don't need it as long as a redirect page can only have one valid link on it. Every use I thought of (such as category redirects being treated as 'aliases' and effectively merging the two categories, which would make renaming a category a *lot* easier) has to access the page table anyway, to get the full text of the title from the id.

Right, sorry for the ramble, if anyone actually read this far, I'd be interested in your thoughts.

-- Rowan Collins BSc [IMSoP]

Richard Holton

8:05 p.m.

On Thu, 3 Mar 2005 13:47:27 +0000, Rowan Collins rowan.collins@gmail.com wrote:

...

On Wed, 2 Mar 2005 20:14:43 -0600, Richard Holton richholton@gmail.com wrote:

...
If I understand things correctly, we should look at eventually changing 'links' and 'broken links' to use: from id->(to namespace, to title).

Would there be a need for separate tables then?

No, almost certainly not. In fact, it would be very helpful to create a unified table, because then we wouldn't need the proliferation of task-specific tables. See, for instance, http://bugzilla.wikimedia.org/show_bug.cgi?id=1065

Thanks for the link to the bug entry. The discussion there helped me to understand some things a bit better.

...

...
Categorylinks seems to be a bit of a special case, since it really amounts to a reverse link.

That's one way of thinking about it. The other, which is more in tune with less formalised wikis (e.g. http://c2.com/cgi/wiki?WikiCategories), is that the category page is just a souped-up "What links here" display, and the links on individual pages just internal links displayed out-of-line. On this view, they're just another kind of link - distinguished only to solve the "Use-mention problem" (http://www.usemod.com/cgi-bin/mb.pl?UseMentionProblem)

Another way to think about things: a category link is like a two-way link. It links both from the page to the category, and from the category to the page. I definitely see the similarity between "what links here" and categories. In that sense, each page becomes its own category, with any link to the page a 'category entry'. Perhaps we could use this relationship to improve the 'what links here' display.

...

...
At the moment, I'm thinking of adding a "fromNamespace" field to categorylinks. This will decouple the display of pages from the display of parent categories, and would facilitate breaking the category display into namespaces.

Hm, that certainly does match your "reverse link" interpretation. Of course, it suffers from a variant of the update problem that the current id->id based links table does: a page can be moved from one namespace to another, taking its category links with it. So, if [[Foo]] contains [[Category:Metasyntactic variables]], and I move it to [[Wikipedia:Foo]], the from_namespace field in the categorylinks table would become wrong, even though the actual page reference (by ID) would remain valid. I don't know whether that update would be more or less sensible/expensive/complex than having to do an extra lookup/join/whatever to get the namespace out of the page table when displaying the category.

Yes, that is the trade off. However, when a category page moves, it only has to update its own categorylink entries -- one for each each [[category:xxx]] on that page. Note that currently, the cl_sortkey field is updated on any page move for those links that don't have a specified sort key.

If you add a category link ([[category:foo]] to a category page "bar", then "bar" becomes a sub-category of Foo. The link from category:bar to category:foo is entered into categorylinks just like any other. Currently, to find subcategories for a category page, the links have to be retrieved and separated by namespace in PHP. This is why the subcategory and other pages on the category page are linked alphabetically. A chunk of up to 200 links to the category are retrieved, then separated. If you wanted to get the next 200 sub-categories, you would have to keep reading in category links until you find 200 subcategories, or until you've read all the categorylinks, whichever comes first. That is why having a namespace field is important.

There are also requests for separating the category displays further into namespaces (e.g. listing templates separately). This would have the same issue as above.

...

There's an additional problem with merging the categorylinks table in with the others anyway, in that it has extra fields cl_sortkey and cl_timestamp that the others don't have or need. That's a shame, because it would be neat, if redesigning the tables, not to have to leave that one out.

By the way, cl_timestamp is currently unused (I believe). It can be handy for debugging, but otherwise I don't know it's intended purpose.

...

Some rambling thoughts, though, on a possible links schema:

l_from (as page_id)

l_to_ns (as int)

l_to_title (as string)

l_is_broken (boolean)

--- to allow everything that brokenlinks currently does; this should be kept seperate from the link type, because if you try to {{include}} a non-existent page, it's both "broken" and a "template" link, and both facts are potentially useful

I don't fully understand the use of a "broken" field. Does this eliminate the need for verifying page existence during rendering?

Maintaining "broken" requires a potentially large update for each page creation, deletion, or move.

...

l_type (probably an int, like namespace)

--- values could include: --- LT_NORMAL: just an ordinary [[free link]] (or an escaped link to a [[:Image:Page]] or [[:Category:Page]]) --- LT_IMAGE: it's an image display (l_to_ns would be redundant, but never mind) --- LT_TEMPLATE: it's a {{template inclusion}} --- LT_INTERWIKI: currently, interwiki links don't get stored anywhere, but it could be useful if they were; for instance, if someone changes the interwiki map, it might be worth looking up what links are affected. Obviously, l_to_ns would be meaningless, and any search for a specific prefix would involve matching the text, but might it be worth considering? --- LT_CATEGORY: as I say, this has rather awkward issues - where do the extra fields go? l_to_ns could be reused [abused] as the namespace of the *originating* page if keeping that up-to-date was considered better than looking it up in the page table

I want to think about your ideas on redirects, but I don't have time at the moment. I'll try to get back to you later.

-Rich Holton en.wikipedia:User:Rholton

Brion Vibber

9:06 p.m.

Richard Holton wrote:

...

I don't fully understand the use of a "broken" field. Does this eliminate the need for verifying page existence during rendering?

Maintaining "broken" requires a potentially large update for each page creation, deletion, or move.

A 'broken' field is not necessary and probably not desireable due to the cost of maintaining it. Currently we sometimes use links/brokenlinks to prefill the link existence cache when rendering, but then we cache that info back to another table (linkcache) anyway...

-- brion vibber (brion @ pobox.com)

Rowan Collins

9:28 p.m.

On Thu, 3 Mar 2005 10:05:15 -0600, Richard Holton richholton@gmail.com wrote:

...

Another way to think about things: a category link is like a two-way link. It links both from the page to the category, and from the category to the page.

...

I definitely see the similarity between "what links here" and categories. In that sense, each page becomes its own category, with any link to the page a 'category entry'.

Yes, that's the concept behind "WikiCategories" on simple wikis like UseMod - the backlinks are the member pages. e.g. http://www.usemod.com/cgi-bin/mb.pl?CategoryMeatball, the contents of which just tells you to look at its backlinks.

...

Perhaps we could use this relationship to improve the 'what links here' display.

That would indeed be incredibly useful - that page is hideously presented, has an arbitrary hard-coded cut-off and is unsorted. If we could use the same display style (or even some shared super-class) as the category listings, that would be a huge improvement.

It almost makes me wonder if all links should store a from_namespace and sortkey, so that we can do the same paging and display seperation tricks - but that would of course require massive updates on every page move, so probably isn't going to happen.

...

...
... Of course, it suffers from a variant of the update problem that the current id->id based links table does: a page can be moved from one namespace to another, taking its category links with it.

...

Yes, that is the trade off. However, when a category page moves, it only has to update its own categorylink entries -- one for each each [[category:xxx]] on that page.

It's not moving the *category* that's the main problem, but moving the *content*, since that could cross namespaces. I think you did get my point, it's just this sentence is either a typo or a non sequitur (or both) ;-)

...

Note that currently, the cl_sortkey field is updated on any page move for those links that don't have a specified sort key.

Ah, I didn't realise that; but I guess it makes sense - especially once you get into the listing needing to be split into pages.

...

Currently, to find subcategories for a category page, the links have to be retrieved and separated by namespace in PHP.

I'm not that hot on SQL (serves me right for not studying straight CS at uni), but isn't it possible to do something like "SELECT <stuff> FROM categorylinks, page WHERE cl_to=<whatever> AND page_namespace=NS_CATEGORY LIMIT 200", thus avoiding the problem you're describing?

I've seen things that look similar to this in the code, but it could be that 1) I have misunderstood, and this particular one's impossible 2) this would in fact be possible, but prohibitively expensive/inefficient, and therefore it's been discarded as an option 3) it would amount to exactly the same activity as doing it as a sequence of queries with PHP in between

...

...
There's an additional problem with merging the categorylinks table in with the others anyway, in that it has extra fields cl_sortkey and cl_timestamp that the others don't have or need. That's a shame, because it would be neat, if redesigning the tables, not to have to leave that one out.

By the way, cl_timestamp is currently unused (I believe). It can be handy for debugging, but otherwise I don't know it's intended purpose.

Hm. I thought maybe it was intended that additions and removals from a category could be presented in a history display of some sort, or on watchlists etc. But a timestamp would only allow listing of additions, so I'm not sure that can be right.

Still, unless there's a sensible way of reusing the sortkey field in other link-types, it's probably not a good idea to merge category links in with everything else. [To store the namespace_from, you could just reuse the otherwise redundant namespace_to field, but nothing springs to mind that could double up with the sortkey in this way]

...

...

l_is_broken (boolean)

--- to allow everything that brokenlinks currently does; this should be kept seperate from the link type, because if you try to {{include}} a non-existent page, it's both "broken" and a "template" link, and both facts are potentially useful

I don't fully understand the use of a "broken" field. Does this eliminate the need for verifying page existence during rendering?

Maintaining "broken" requires a potentially large update for each page creation, deletion, or move.

Well, there must be some reason we currently store "brokenlinks" separate from "links", right? I'm not sure the distinction is used when rendering, that seems to be done by looking at the page table, but there's an awful lot of deferral and caching, so maybe it is. [Or, alternatively, maybe it should be! I don't know.] It's used by utilities like Special:Wantedpages, I know that much. It certainly seems to make sense to store the distinction.

As for updates - yes, creation or deletion will have a big impact, but all the linking pages need their cache invalidating anyway, because they'll have the wrong colour links, so it's always going to be a big deal. And note that *moving* a page doesn't break any links - there's still a page at the old location, albeit a brand new redirect. Somebody might delete that redirect later, but that's a seperate action.

-- Rowan Collins BSc [IMSoP]

Richard Holton

4 Mar 4 Mar

5:15 a.m.

On Thu, 3 Mar 2005 17:28:50 +0000, Rowan Collins rowan.collins@gmail.com wrote:

...

On Thu, 3 Mar 2005 10:05:15 -0600, Richard Holton richholton@gmail.com wrote:

...
Yes, that is the trade off. However, when a category page moves, it only has to update its own categorylink entries -- one for each each [[category:xxx]] on that page.

It's not moving the *category* that's the main problem, but moving the *content*, since that could cross namespaces. I think you did get my point, it's just this sentence is either a typo or a non sequitur (or both) ;-)

Yes, it was a typo. I meant that when a page moves, it only has to update its own categorylink entries.

...

...
Note that currently, the cl_sortkey field is updated on any page move for those links that don't have a specified sort key.

Ah, I didn't realise that; but I guess it makes sense - especially once you get into the listing needing to be split into pages.

...
Currently, to find subcategories for a category page, the links have to be retrieved and separated by namespace in PHP.

I'm not that hot on SQL (serves me right for not studying straight CS at uni), but isn't it possible to do something like "SELECT <stuff> FROM categorylinks, page WHERE cl_to=<whatever> AND page_namespace=NS_CATEGORY LIMIT 200", thus avoiding the problem you're describing?

I've seen things that look similar to this in the code, but it could be that

I have misunderstood, and this particular one's impossible

this would in fact be possible, but prohibitively

expensive/inefficient, and therefore it's been discarded as an option 3) it would amount to exactly the same activity as doing it as a sequence of queries with PHP in between

Yes, I realized the same thing about 20 minutes after I posted my message. Funny how you can fail to see something until you actually put things to print. The namespace can be grabbed from the page table via a join. I _don't_ know what sort of inefficiencies are introduced by doing so.

I was thinking of having an index on (cl_to, cl_from_namespace, cl_sortkey) to have really speedy category page builds.

If I remember my CS courses correctly (now many years old), proper database design would have us not duplicate the namespace field. However, efficiency does sometimes trump theory.

...

...
...
There's an additional problem with merging the categorylinks table in with the others anyway, in that it has extra fields cl_sortkey and cl_timestamp that the others don't have or need. That's a shame, because it would be neat, if redesigning the tables, not to have to leave that one out.

By the way, cl_timestamp is currently unused (I believe). It can be handy for debugging, but otherwise I don't know it's intended purpose.

Hm. I thought maybe it was intended that additions and removals from a category could be presented in a history display of some sort, or on watchlists etc. But a timestamp would only allow listing of additions, so I'm not sure that can be right.

Still, unless there's a sensible way of reusing the sortkey field in other link-types, it's probably not a good idea to merge category links in with everything else. [To store the namespace_from, you could just reuse the otherwise redundant namespace_to field, but nothing springs to mind that could double up with the sortkey in this way]

I'm not thrilled with the idea of fudging the namespace_from data into the namespace_to field., though I certainly could live with it. However, I also cannot figure out any use for the cl_sortkey field for non-category links. We're probably better off not trying to force-fit things.

...

...
...

l_is_broken (boolean)

--- to allow everything that brokenlinks currently does; this should be kept seperate from the link type, because if you try to {{include}} a non-existent page, it's both "broken" and a "template" link, and both facts are potentially useful

I don't fully understand the use of a "broken" field. Does this eliminate the need for verifying page existence during rendering?

Maintaining "broken" requires a potentially large update for each page creation, deletion, or move.

Well, there must be some reason we currently store "brokenlinks" separate from "links", right? I'm not sure the distinction is used when rendering, that seems to be done by looking at the page table, but there's an awful lot of deferral and caching, so maybe it is. [Or, alternatively, maybe it should be! I don't know.] It's used by utilities like Special:Wantedpages, I know that much. It certainly seems to make sense to store the distinction.

As for updates - yes, creation or deletion will have a big impact, but all the linking pages need their cache invalidating anyway, because they'll have the wrong colour links, so it's always going to be a big deal. And note that *moving* a page doesn't break any links - there's still a page at the old location, albeit a brand new redirect. Somebody might delete that redirect later, but that's a seperate action.

Yes, but if a page is moved to a previously unoccupied location, then you need to check if some previously broken links are now not broken.

-Rich Holton en.wikipedia:User:Rholton

7247

Age (days ago)

7249

Last active (days ago)

wikitech-l@lists.wikimedia.org

11 comments

4 participants

tags (0)

participants (4)

Brion Vibber
Gerard Meijssen
Richard Holton
Rowan Collins