On Friday, May 27, 2016, Ryan Kaldari rkaldari@wikimedia.org wrote:
There are currently 94 WMF wikis using UCA category collation rather than the default "uppercase" collation. The Unicode Collation Algorithm (UCA)
is
the official standard for how to sort Unicode characters, and generally follows how a human would typically alphabetize strings. For example, uppercase collation sorts Aztec, Ärsenik, Zoo, Aardvark as "Aardvark, Aztec, Zoo, Ärsenik", but uca-default collation sorts them as "Aardvark, Ärsenik, Aztec, Zoo". UCA collation also (optionally) supports natural numeric sorting so that 100, 1, 99 sorts as "1, 99, 100" rather than "1, 100, 99". The WMF Community Tech team has recently posted proposals on English Wikipedia and several Wiktionaries asking if these communities would support switching to UCA collation. The proposal on English
Wikipedia
has received unanimous support so far.[1] We thought that Wiktionaries would be more skeptical of the change, but so far we have received only positive responses.[2]
Since it seems that most wikis are receptive to switching to UCA, maybe we should just make it the default rather than waiting on all the wikis to request it separately. Of the large Wikipedias, French, Dutch, Polish, Portuguese, and Russian are already using UCA, and German is in the
process
of switching.[3] For non-Latin scripts, my understanding is that UCA will be a big improvement, especially if we switch them to language-specific implementations, like uca-ja, uca-zh, uca-ar, etc.
Three questions:
- Does switching the default collation from "uppercase" to "uca-default"
sound like a good idea? 2. Should this be proposed on meta or is it too technical? 3. Are there any wikis that would need to opt out of this for some reason? (I know there are issues with Kurdish,[4] but that's the only one I know about.)
https://en.wikipedia.org/wiki/Wikipedia_talk:Categorization#OK_to_switch_Eng...
- https://phabricator.wikimedia.org/T128502
- https://phabricator.wikimedia.org/T128806
- https://phabricator.wikimedia.org/T48235
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
I think we should start with the one's that actually have locales in the icu project. Well perhaps for other languages, uca-default is a better fallback, starting with the ones that have been specificly checked by upstream as being a good match sounds like a less controversial first step.
For numeric, id suggest it be actually deployed somewhere first (not to mention actually written) in case there are unexpected issues, before talking about deploying it everywhere.
-- Bawolff