I'm going to begin working on the following bugs:
* "Support collation by a certain locale (sorting order of characters)", https://bugzilla.wikimedia.org/show_bug.cgi?id=164 (only parts related to category sorting) * "Subcategory paging is not separate from article or image paging", https://bugzilla.wikimedia.org/show_bug.cgi?id=1211 * "CategoryTree is inefficient", https://bugzilla.wikimedia.org/show_bug.cgi?id=23682
As well as possibly:
* "Categories need to be structured by namespace", https://bugzilla.wikimedia.org/show_bug.cgi?id=450 * "Natural number sorting in category listings", https://bugzilla.wikimedia.org/show_bug.cgi?id=6948
There are essentially two problems here:
1) We currently sort articles on category pages by the Unicode code point of their sort key. This is terrible for anything other than English, and dodgy sometimes even for English. (This is bugs 164 and 6948.)
2) We have no way to efficiently get all items that are in a category and also in a particular namespace. Particularly, we can't retrieve all subcategories without scanning all items in the category, which is inefficient when we have a few (or no) subcategories and tons of items. (This is bugs 1211, 23682, and 450.)
One part of (2) needs to be clarified. The primary use-case is obviously that we want to be able to count subcategories efficiently, or display all of them when we only display some of the items in the category: this is bugs 1211 and 23682. Secondarily, we have a request at bug 450 to organize category pages by namespace, so main, Talk:, User:, etc. are all paginated separately.
I think the goal for (2) should be to allow efficient separate retrieval of subcategories, files, and other pages, but not to distinguish between namespaces otherwise. The major motivation is that to do this efficiently, we'll need to add namespace info to the categorylinks table, and we want this to stay consistent with the info in the page table. Categories, files, and other types of pages cannot be moved to one another, as far as I know (it would hardly make sense), so it automatically stays consistent this way. This is a big plus, because there are inevitably bugs that cause denormalized data to fall out of sync (look at cat_pages).
Furthermore, I don't think it's obvious that we want separate namespaces to display separately at all on category pages. What's a case where that would be desired? It would break up the display a lot, with a bunch of separate headers for different namespaces, when each namespace might only have a few items. Most categories whose sort appearance you'd care about (i.e., excepting maintenance categories) will have nearly everything in one namespace anyway. You could always split the category into separate ones per namespace if you want them separate.
So I propose that we keep the current category/normal page/file split, and paginate those three parts of the page separately. So you'd have up to 200 subcategories, then below that up to 200 normal pages, then below that up to 200 files. (The numbers could be adjusted. Currently they're hardcoded, which is stupid.) Paginating subcategories separately is obviously needed. Paginating files separately is not really needed, but it would be much more consistent.
The overall solution, then, would be:
1) Change the way category sortkeys are generated. Start them with a letter depending on namespace, like 'C' for category, 'P' for regular page, 'F' for file. After that first letter, append a sortkey generated by ICU or whatever. I think Tim has opinions on what would be a good choice to convert the article title into sort key -- if not, I'll have to research it and hopefully not come up with a completely incorrect answer.
2) On category pages, maintain three offsets and do three queries (or maybe UNION them together, doesn't matter), one for each of categories/regular pages/files. Because of (1), this will be efficient and will also sort less unreasonably for non-English languages.
One problem that was pointed out somewhere in the massive useless discussion on bug 164 is that we'd have to do something to display the first letter for each section. Currently it's just the first letter of the sortkey, but if that's some binary string, that becomes a problem. I'm not seeing an obvious solution, since the sortkey-generation algorithm will be opaque to us. If it sorts Á the same as A, then how do we figure out that the "canonical" first letter for the section should be "A" and not "Á"? How do we even figure out where the sections begin or end? Would that even make sense in all cases? At a first pass, I'd say we should just skip the first letter and display all the items straight from beginning to end without section divisions. I don't think that's a big problem.
This is just my initial thoughts. Feedback appreciated. If people agree with the general approach, I can start coding this up tomorrow.