There are currently 94 WMF wikis using UCA category collation rather than the default "uppercase" collation. The Unicode Collation Algorithm (UCA) is the official standard for how to sort Unicode characters, and generally follows how a human would typically alphabetize strings. For example, uppercase collation sorts Aztec, Ärsenik, Zoo, Aardvark as "Aardvark, Aztec, Zoo, Ärsenik", but uca-default collation sorts them as "Aardvark, Ärsenik, Aztec, Zoo". UCA collation also (optionally) supports natural numeric sorting so that 100, 1, 99 sorts as "1, 99, 100" rather than "1, 100, 99". The WMF Community Tech team has recently posted proposals on English Wikipedia and several Wiktionaries asking if these communities would support switching to UCA collation. The proposal on English Wikipedia has received unanimous support so far.[1] We thought that Wiktionaries would be more skeptical of the change, but so far we have received only positive responses.[2]
Since it seems that most wikis are receptive to switching to UCA, maybe we should just make it the default rather than waiting on all the wikis to request it separately. Of the large Wikipedias, French, Dutch, Polish, Portuguese, and Russian are already using UCA, and German is in the process of switching.[3] For non-Latin scripts, my understanding is that UCA will be a big improvement, especially if we switch them to language-specific implementations, like uca-ja, uca-zh, uca-ar, etc.
Three questions: 1. Does switching the default collation from "uppercase" to "uca-default" sound like a good idea? 2. Should this be proposed on meta or is it too technical? 3. Are there any wikis that would need to opt out of this for some reason? (I know there are issues with Kurdish,[4] but that's the only one I know about.)
1. https://en.wikipedia.org/wiki/Wikipedia_talk:Categorization#OK_to_switch_Eng... 2. https://phabricator.wikimedia.org/T128502 3. https://phabricator.wikimedia.org/T128806 4. https://phabricator.wikimedia.org/T48235