In r80443 I added a feature allowing categories to be sorted using the Unicode Collation Algorithm (UCA). I wanted to briefly talk about the potential user impact, the design choices and the caveats.
Sorting was the easy part. The hard part was providing a "first letter" concept which would be reasonably sane. The idea I came up with was to compile a list of first letters, themselves sorted using the UCA. Then the "first letter" of a given string is the nearest letter in the list which sorts above the string.
For instance if you have letters A, B, C, and a string Aardvark, if you sort them you get:
A Aardvark B C
So we know that A is the first letter of Aardvark because Aardvark sorts immediately below A. This algorithm gives us a number of nice properties:
* It automatically drops accents, since accented letters sort the same as unaccented letters (at the primary level). Same with case differences, hiragana/katakana, etc.
* You can work out the initial Jamo of a Hangul syllable character by just omitting the composed syllables from the "first letter" list. Previously this was done with a special-case hack in Language::firstChar().
* Vowel reordering in Thai and Lao is automatically supported. So "แก" sorts under heading "ก" and "แข" sorts under heading "ข".
* The collation can be expanded to support all sorts of other crazy features, and the first letter feature will keep working in a sane way. For instance, you could have an English collation which removed "the" from the start of a title.
I compiled a list of 14,742 suitable header characters, identified by processing various Unicode data files. That list probably still needs lots of tweaks.
There is a down side to this scheme. The default UCA table gives all characters with a similar logical function to the digits 0-9 the same primary sort order as the corresponding ASCII digits. So a page like [[१९२०]] on the Bihari Wikipedia will sort under a heading of "1" instead of "१". There may be other instances of accidental cultural imperialism. However, this can be fixed by compiling language-dependent lists of header characters.
The UCA default table is not meant to sort any language correctly, it's just a compromise collation. Support for language-specific collations can easily be added. Whether we get language-specific collations or not, I'd like to think about enabling this feature on Wikimedia.
The most glaring omission from the UCA default tables is sensible sorting of the unified Han.
In a Chinese context, there's an obvious way to sort characters, and that's by their order in the KangXi dictionary. The Unihan database gives such an ordering, and it's used within code blocks. But it's not used between code blocks. So if you sort by code point, all the Han characters that aren't in the U+4E00 to U+9FFF block will sort incorrectly. That's what the default UCA does, with a few minor exceptions.
In a Japanese context, the way to sort ideographic characters is to convert them to phonetic hiragana and then to sort the resulting string. I don't know if there is any free software for doing this. On the Japanese Wikipedia, they achieve the same result by manually setting the sort key of every page to be the hiragana version of the title.
There's lots of room here for other people to get involved, especially if you know a language other than English.
-- Tim Starling
2011/1/17 Tim Starling tstarling@wikimedia.org:
- It automatically drops accents, since accented letters sort the same
as unaccented letters (at the primary level).
How locale aware is it? For example, in Swedish accented letters come at the end of the alphabet and in Lithuanian I, Į and Y are collated together as if they were one letter. There are many quirks of this kind in other languages.
And i don't know what to do when in the Lithuanian Wikipedia you sort names of places in the UK - should Islington come before or after York? (But hey, there's at least one Lithuanian MediaWiki developer, so i don't know whether my help is really needed here.)
-- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com "We're living in pieces, I want to live in peace." - T. Moore
On 18/01/11 07:41, Amir E. Aharoni wrote:
2011/1/17 Tim Starling tstarling@wikimedia.org:
- It automatically drops accents, since accented letters sort the same
as unaccented letters (at the primary level).
How locale aware is it? For example, in Swedish accented letters come at the end of the alphabet and in Lithuanian I, Į and Y are collated together as if they were one letter. There are many quirks of this kind in other languages.
It's not locale-aware. As I said, it's a compromise collation. I was hoping that other people might be interested in adding support for specific locales, that's part of the reason for my post. ICU supports lots of different locales, and there is locale-specific collation data in the CLDR.
And i don't know what to do when in the Lithuanian Wikipedia you sort names of places in the UK - should Islington come before or after York?
Before.
$collator = new Collator('lt') print $collator->compare( 'Islington', 'York' )
-1
But more interestingly, York goes before London:
print $collator->compare( 'York', 'London' )
-1
I think attempting to do it any other way would be a lot of trouble, and not what is wanted anyway. To put the question another way: on the English Wikipedia, should Kybartai sort before Klaipėda? I would think not.
(But hey, there's at least one Lithuanian MediaWiki developer, so i don't know whether my help is really needed here.)
If you mean Domas, I don't think this is the kind of thing he's interested in.
-- Tim Starling
2011/1/18 Tim Starling tstarling@wikimedia.org:
On 18/01/11 07:41, Amir E. Aharoni wrote:
And i don't know what to do when in the Lithuanian Wikipedia you sort names of places in the UK - should Islington come before or after York?
Before.
$collator = new Collator('lt') print $collator->compare( 'Islington', 'York' )
-1
But more interestingly, York goes before London:
print $collator->compare( 'York', 'London' )
-1
'York' before 'London' makes sense in lt context, but 'York' before 'Islington' is weird, because to the best of my understanding, it's supposed to be sorted as if it was written 'Iork'.
A dictionary that i have at home puts 'ylaragis' before 'įlašeti'.
I think attempting to do it any other way would be a lot of trouble, and not what is wanted anyway. To put the question another way: on the English Wikipedia, should Kybartai sort before Klaipėda? I would think not.
The intuitive answer is that in en.wikipedia Kybartai should usually be after Klaipėda, although some clever sorting is desirable. Even more so for Wiktionary.
For lt.wikipedia, this is something that its editors and readers should decide.
Tim Starling (2011-01-18 02:03):
On 18/01/11 07:41, Amir E. Aharoni wrote:
2011/1/17 Tim Starlingtstarling@wikimedia.org:
- It automatically drops accents, since accented letters sort the same
as unaccented letters (at the primary level).
How locale aware is it? For example, in Swedish accented letters come at the end of the alphabet and in Lithuanian I, Į and Y are collated together as if they were one letter. There are many quirks of this kind in other languages.
It's not locale-aware. As I said, it's a compromise collation. I was hoping that other people might be interested in adding support for specific locales, that's part of the reason for my post. ICU supports lots of different locales, and there is locale-specific collation data in the CLDR.
And i don't know what to do when in the Lithuanian Wikipedia you sort names of places in the UK - should Islington come before or after York?
Before.
$collator = new Collator('lt') print $collator->compare( 'Islington', 'York' )
-1
But more interestingly, York goes before London:
print $collator->compare( 'York', 'London' )
-1
I think attempting to do it any other way would be a lot of trouble, and not what is wanted anyway. To put the question another way: on the English Wikipedia, should Kybartai sort before Klaipėda? I would think not.
I've seen sorting accent insensitive and so for example "Bańka" would be sorted as if it was "Banka", but I haven't yet seen phone insensitive or whatever you call it. What I mean is in Poland "rz" i pronounced the same (almost the same) as "ż", but "rz" is nowhere near "ż" when it comes to sorting. In fact it would be very counter intuitive for me (as would be 'York' < 'London'). I think it would not be helpful especially for foreigners. I've also said that I've _seen_ accent insensitive dictionaries, but _most_ are case sensitive and so "ą" > "a" not "ą"="a" also when it comes to the first letter all dictionaries I know have "Ż" separate from "Z". You might see our collation as - without accent first and with accent second. This is the why we say are ABC. And it would be intuitive for to have English collation by it's ABC with Y coming just before Z.
I think the problem should only be solved for letters which are not just Latin character + accent. How to sort them in Latin (and Latin based) characters.
Regards, Nux.
Maciej Jaros (2011-01-18 15:42):
Tim Starling (2011-01-18 02:03):
On 18/01/11 07:41, Amir E. Aharoni wrote:
2011/1/17 Tim Starlingtstarling@wikimedia.org:
- It automatically drops accents, since accented letters sort the same
as unaccented letters (at the primary level).
How locale aware is it? For example, in Swedish accented letters come at the end of the alphabet and in Lithuanian I, Į and Y are collated together as if they were one letter. There are many quirks of this kind in other languages.
It's not locale-aware. As I said, it's a compromise collation. I was hoping that other people might be interested in adding support for specific locales, that's part of the reason for my post. ICU supports lots of different locales, and there is locale-specific collation data in the CLDR.
And i don't know what to do when in the Lithuanian Wikipedia you sort names of places in the UK - should Islington come before or after York?
Before.
$collator = new Collator('lt') print $collator->compare( 'Islington', 'York' )
-1
But more interestingly, York goes before London:
print $collator->compare( 'York', 'London' )
-1
I think attempting to do it any other way would be a lot of trouble, and not what is wanted anyway. To put the question another way: on the English Wikipedia, should Kybartai sort before Klaipėda? I would think not.
I've seen sorting accent insensitive and so for example "Bańka" would be sorted as if it was "Banka", but I haven't yet seen phone insensitive or whatever you call it. What I mean is in Poland "rz" is pronounced the same (almost the same) as "ż", but "rz" is nowhere near "ż" when it comes to sorting. In fact it would be very counter intuitive for me (as would be 'York'< 'London'). I think it would not be helpful especially for foreigners. I've also said that I've _seen_ accent insensitive dictionaries, but _most_ are case sensitive and so "ą"> "a" not "ą"="a" also when it comes to the first letter all dictionaries I know have "Ż" separate from "Z". You might see our collation as - without accent first and with accent second. /This is the why we say are ABC. And it would be intuitive for to have English collation by it's ABC with Y coming just before Z./
Sorry, sometimes I type phonetically :-). The last sentences were supposed to be:
This is the way we say our ABC. And it would be intuitive for me to have English collation by its ABC with Y coming just before Z.
I think the problem should only be solved for letters which are not just Latin character + accent. How to sort them in Latin (and Latin based) characters.
Regards, Nux.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org