Category sorting and first letters - Wikitech-l

17 Jan 2011


      In r80443 I added a feature allowing categories to be sorted using the
Unicode Collation Algorithm (UCA). I wanted to briefly talk about the
potential user impact, the design choices and the caveats.
Sorting was the easy part. The hard part was providing a "first
letter" concept which would be reasonably sane. The idea I came up
with was to compile a list of first letters, themselves sorted using
the UCA. Then the "first letter" of a given string is the nearest
letter in the list which sorts above the string.
For instance if you have letters A, B, C, and a string Aardvark, if
you sort them you get:
A
Aardvark
B
C
So we know that A is the first letter of Aardvark because Aardvark
sorts immediately below A. This algorithm gives us a number of nice
properties:
* It automatically drops accents, since accented letters sort the same
as unaccented letters (at the primary level). Same with case
differences, hiragana/katakana, etc.
* You can work out the initial Jamo of a Hangul syllable character by
just omitting the composed syllables from the "first letter" list.
Previously this was done with a special-case hack in
Language::firstChar().
* Vowel reordering in Thai and Lao is automatically supported.
So "แก" sorts under heading "ก" and "แข" sorts under heading "ข".
* The collation can be expanded to support all sorts of other crazy
features, and the first letter feature will keep working in a sane
way. For instance, you could have an English collation which removed
"the" from the start of a title.
I compiled a list of 14,742 suitable header characters, identified by
processing various Unicode data files. That list probably still needs
lots of tweaks.
There is a down side to this scheme. The default UCA table gives all
characters with a similar logical function to the digits 0-9 the same
primary sort order as the corresponding ASCII digits. So a page like
[[१९२०]] on the Bihari Wikipedia will sort under a heading of "1"
instead of "१". There may be other instances of accidental cultural
imperialism. However, this can be fixed by compiling
language-dependent lists of header characters.
The UCA default table is not meant to sort any language correctly,
it's just a compromise collation. Support for language-specific
collations can easily be added. Whether we get language-specific
collations or not, I'd like to think about enabling this feature on
Wikimedia.
The most glaring omission from the UCA default tables is sensible
sorting of the unified Han.
In a Chinese context, there's an obvious way to sort characters, and
that's by their order in the KangXi dictionary. The Unihan database
gives such an ordering, and it's used within code blocks. But it's not
used between code blocks. So if you sort by code point, all the Han
characters that aren't in the U+4E00 to U+9FFF block will sort
incorrectly. That's what the default UCA does, with a few minor
exceptions.
In a Japanese context, the way to sort ideographic characters is to
convert them to phonetic hiragana and then to sort the resulting
string. I don't know if there is any free software for doing this. On
the Japanese Wikipedia, they achieve the same result by manually
setting the sort key of every page to be the hiragana version of the
title.
There's lots of room here for other people to get involved, especially
if you know a language other than English.
-- Tim Starling