On Thu, May 14, 2009 at 1:48 PM, Marcus Buck wiki@marcusbuck.org wrote:
If I continue my plain layman approach, I would add a new unique column "page_sortkey" to the table "page", that consists of the (uppercased, str_replaced as described in my previous post, and with whatever replacements else necessary to make it sort properly) pagename, " " (two whitespaces) as separator, and the page_id of the page. The page_id adds uniqueness, the two whitespaces make short words always sort before long words, as " " is the first character byte-order-wise and two whitespaces can never occur in page titles.
Special:Allpages would then "ORDER BY 'page_sortkey'" instead of "ORDER BY 'page_title'". If the (language-specific) function that converts the pagename to the sortkey string is good, that should make Allpages sort properly. The only remaining problem would be that pairs like "Günther" and "Gunther" would always sort according to the page_id instead of first "Gunther" and then "Günther". Could perhaps be solved by also making the strpos of the first character that was replaced by our function part of the sortkey, but I guess that would be over the top ;-) But most likely I missed some problem, that makes this not work.
You don't need to append the page id redundantly to the column -- you could just append the actual page_id column to the index, and sort by it as well. It saves you a few bytes. We would be doing this already for category pagination if Pager actually supported it.
Other than that, a problem with this is that page titles aren't just in page, they're scattered all over the place. We use (namespace, title) pairs for pagelinks, templatelinks, recentchanges, watchlist, logging, redirect, protected_titles, among others, and probably some extension tables as well. (This isn't necessarily denormalization -- most of those references are supposed to point to the old name if the page is moved, which the page ID wouldn't.)
Some of these tables probably have sorts by title done on them in the UI somewhere. It would be pretty scary for us to add an extra column for each one *and keep it updated*. Some of these tables can have many thousands of rows for a single page, and each row would need to be updated whenever the page's sort key changes. A change to the sorting algorithm would require changing every row of some fairly large tables (like logging) that currently are more or less insert-only. Of course, the same would be true for just categorylinks, so I guess it's not such a big deal.
We do sorts by page title on at least some of these. For instance, a sorted list of watchlist pages:
http://en.wikipedia.org/wiki/Special:Watchlist/edit
Admittedly, I can't think of many situations where it actually *matters*, or where people use the sorted list much, other than categories and perhaps page table stuff. So maybe categorylinks and page would be sufficient after all.