Aryeh Gregor <Simetrical+wikilist <at> gmail.com> writes:
On Thu, May 14, 2009 at 10:34 AM, Marcus Buck <wiki
<at> marcusbuck.org> wrote:
Take the pagename and make it uppercase (could be
lowercase too, but
uppercase seems better as the first letter will show up in the
category). str_replace "Ä" with "A", "Ö" with
"O", "Ü" with "U" and "ß"
with "SS". Also str_replace other Latin characters with diacritics with
their counterpart without diacritic. And that's our sortkey. This very
simple procedure should reduce the number of necessary defaultsorts
(except for articles about persons) by about 90% in the German wikipedia.
This would absolutely be possible as a "mostly works" solution for
category sorting.It would mostly just need to have the appropriate
code written.
Most of that can be done with one single language-independent algorithm. All the
collation rules I've seen until now fall into one of four categories:
1. you just need to transform to lowercase and discard diacritics (and space,
punctuation etc.). That is, you can do a Unicode decomposition and then throw
out everything from the combining ranges. I think there are very few languages
where that would be fully correct (for example it doesn't handle the German ß),
but it would make sorting a lot less wrong, at least for Latin scripts. (For
example, orr < őr < ott is incorrect in Hungarian as ő should be sorted after o,
but it is still a lot better than putting ő way down after z.) And it only has
to be written once.
2. you need a translation table with string replacement rules like ö => o, ő =>
o~, ß => ss. Works for most languages with Latin letters and probably a lot of
others. Needs per-language rules, but it is much easier to ask language
communities to provide translation rules than to ask them to write sorting code
(and then review it). Most wikis probably already developed those rules and use
them with DEFULTSORT.
3. you need a multipass replacement with multiple translation tables (and you
concatenate the result using some sort of separator character). Theoretically
two passes should be done for a lot of languages (they define equivalence
classes for the accented characters in the first pass, then sort on the accents
in the second), but in practice you get the right result when you do one pass
and then sort on (sortkey, page_title) in the queries. Still, there are a few
languages where you need multiple passes (such as Thai, where you sort on
consonants first, and only after that on vowels).
4. in some languages such as Chinese it is impossible to sort correctly without
a dictionary.
Coding the first or second type of collation rule seems relatively simple, and
already a huge gain. (Also, RFC 3454 might be worth checking out as it has
language-independent rules for more than diacritics.)
The only serious problem with it is that if the rules
for automatic default sorting changed, a script of some sort would
probably have to reparse all pages in some cases to figure out the
original sort key provided, which would be kind of expensive.
You can have a separate raw_sortkey column if that's a large concern. Anyway,
this is the same for any solution that does not rely on MySQL collation: when
the rules change, you need to update the relevant column in the database.
What are the chances that we get decent MySQL collation in the close future
(say, next few years)? Bug 164 was opened 5 years ago, there is no point in
waiting another 5 years for database-level collations (and we do get them, the
system proposed in this thread can be removed without any complication). Waiting
forever will only result in people implementing the same solution with
DEFAULTSORT, either by hand (huge waste of resources) or with bots (even more
expensive than a built-in algorithm).