On Friday, May 27, 2016, Ryan Kaldari <rkaldari(a)wikimedia.org> wrote:
There are currently 94 WMF wikis using UCA category
collation rather than
the default "uppercase" collation. The Unicode Collation Algorithm (UCA)
is
the official standard for how to sort Unicode
characters, and generally
follows how a human would typically alphabetize strings. For example,
uppercase collation sorts Aztec, Ärsenik, Zoo, Aardvark as "Aardvark,
Aztec, Zoo, Ärsenik", but uca-default collation sorts them as "Aardvark,
Ärsenik, Aztec, Zoo". UCA collation also (optionally) supports natural
numeric sorting so that 100, 1, 99 sorts as "1, 99, 100" rather than "1,
100, 99". The WMF Community Tech team has recently posted proposals on
English Wikipedia and several Wiktionaries asking if these communities
would support switching to UCA collation. The proposal on English
Wikipedia
has received unanimous support so far.[1] We thought
that Wiktionaries
would be more skeptical of the change, but so far we have received only
positive responses.[2]
Since it seems that most wikis are receptive to switching to UCA, maybe we
should just make it the default rather than waiting on all the wikis to
request it separately. Of the large Wikipedias, French, Dutch, Polish,
Portuguese, and Russian are already using UCA, and German is in the
process
of switching.[3] For non-Latin scripts, my
understanding is that UCA will
be a big improvement, especially if we switch them to language-specific
implementations, like uca-ja, uca-zh, uca-ar, etc.
Three questions:
1. Does switching the default collation from "uppercase" to
"uca-default"
sound like a good idea?
2. Should this be proposed on meta or is it too technical?
3. Are there any wikis that would need to opt out of this for some reason?
(I know there are issues with Kurdish,[4] but that's the only one I know
about.)
1.
https://en.wikipedia.org/wiki/Wikipedia_talk:Categorization#OK_to_switch_En…
I think we should start with the one's that actually have locales in the
icu project. Well perhaps for other languages, uca-default is a better
fallback, starting with the ones that have been specificly checked by
upstream as being a good match sounds like a less controversial first step.
For numeric, id suggest it be actually deployed somewhere first (not to
mention actually written) in case there are unexpected issues, before
talking about deploying it everywhere.
--
Bawolff