On Thu, May 14, 2009 at 1:48 PM, Marcus Buck <wiki(a)marcusbuck.org> wrote:
If I continue my plain layman approach, I would add a
new unique column
"page_sortkey" to the table "page", that consists of the
(uppercased,
str_replaced as described in my previous post, and with whatever
replacements else necessary to make it sort properly) pagename, " "
(two whitespaces) as separator, and the page_id of the page. The page_id
adds uniqueness, the two whitespaces make short words always sort before
long words, as " " is the first character byte-order-wise and two
whitespaces can never occur in page titles.
Special:Allpages would then "ORDER BY 'page_sortkey'" instead of
"ORDER
BY 'page_title'". If the (language-specific) function that converts the
pagename to the sortkey string is good, that should make Allpages sort
properly. The only remaining problem would be that pairs like "Günther"
and "Gunther" would always sort according to the page_id instead of
first "Gunther" and then "Günther". Could perhaps be solved by also
making the strpos of the first character that was replaced by our
function part of the sortkey, but I guess that would be over the top ;-)
But most likely I missed some problem, that makes this not work.
You don't need to append the page id redundantly to the column -- you
could just append the actual page_id column to the index, and sort by
it as well. It saves you a few bytes. We would be doing this already
for category pagination if Pager actually supported it.
Other than that, a problem with this is that page titles aren't just
in page, they're scattered all over the place. We use (namespace,
title) pairs for pagelinks, templatelinks, recentchanges, watchlist,
logging, redirect, protected_titles, among others, and probably some
extension tables as well. (This isn't necessarily denormalization --
most of those references are supposed to point to the old name if the
page is moved, which the page ID wouldn't.)
Some of these tables probably have sorts by title done on them in the
UI somewhere. It would be pretty scary for us to add an extra column
for each one *and keep it updated*. Some of these tables can have
many thousands of rows for a single page, and each row would need to
be updated whenever the page's sort key changes. A change to the
sorting algorithm would require changing every row of some fairly
large tables (like logging) that currently are more or less
insert-only. Of course, the same would be true for just
categorylinks, so I guess it's not such a big deal.
We do sorts by page title on at least some of these. For instance, a
sorted list of watchlist pages:
http://en.wikipedia.org/wiki/Special:Watchlist/edit
Admittedly, I can't think of many situations where it actually
*matters*, or where people use the sorted list much, other than
categories and perhaps page table stuff. So maybe categorylinks and
page would be sufficient after all.