Aryeh Gregor <Simetrical+wikilist <at> gmail.com> writes:
On Thu, May 14, 2009 at 10:34 AM, Marcus Buck <wiki <at> marcusbuck.org> wrote:
Take the pagename and make it uppercase (could be lowercase too, but uppercase seems better as the first letter will show up in the category). str_replace "Ä" with "A", "Ö" with "O", "Ü" with "U" and "ß" with "SS". Also str_replace other Latin characters with diacritics with their counterpart without diacritic. And that's our sortkey. This very simple procedure should reduce the number of necessary defaultsorts (except for articles about persons) by about 90% in the German wikipedia.
This would absolutely be possible as a "mostly works" solution for category sorting.It would mostly just need to have the appropriate code written.
Most of that can be done with one single language-independent algorithm. All the collation rules I've seen until now fall into one of four categories:
1. you just need to transform to lowercase and discard diacritics (and space, punctuation etc.). That is, you can do a Unicode decomposition and then throw out everything from the combining ranges. I think there are very few languages where that would be fully correct (for example it doesn't handle the German ß), but it would make sorting a lot less wrong, at least for Latin scripts. (For example, orr < őr < ott is incorrect in Hungarian as ő should be sorted after o, but it is still a lot better than putting ő way down after z.) And it only has to be written once.
2. you need a translation table with string replacement rules like ö => o, ő => o~, ß => ss. Works for most languages with Latin letters and probably a lot of others. Needs per-language rules, but it is much easier to ask language communities to provide translation rules than to ask them to write sorting code (and then review it). Most wikis probably already developed those rules and use them with DEFULTSORT.
3. you need a multipass replacement with multiple translation tables (and you concatenate the result using some sort of separator character). Theoretically two passes should be done for a lot of languages (they define equivalence classes for the accented characters in the first pass, then sort on the accents in the second), but in practice you get the right result when you do one pass and then sort on (sortkey, page_title) in the queries. Still, there are a few languages where you need multiple passes (such as Thai, where you sort on consonants first, and only after that on vowels).
4. in some languages such as Chinese it is impossible to sort correctly without a dictionary.
Coding the first or second type of collation rule seems relatively simple, and already a huge gain. (Also, RFC 3454 might be worth checking out as it has language-independent rules for more than diacritics.)
The only serious problem with it is that if the rules for automatic default sorting changed, a script of some sort would probably have to reparse all pages in some cases to figure out the original sort key provided, which would be kind of expensive.
You can have a separate raw_sortkey column if that's a large concern. Anyway, this is the same for any solution that does not rely on MySQL collation: when the rules change, you need to update the relevant column in the database.
What are the chances that we get decent MySQL collation in the close future (say, next few years)? Bug 164 was opened 5 years ago, there is no point in waiting another 5 years for database-level collations (and we do get them, the system proposed in this thread can be removed without any complication). Waiting forever will only result in people implementing the same solution with DEFAULTSORT, either by hand (huge waste of resources) or with bots (even more expensive than a built-in algorithm).