There are currently 94 WMF wikis using UCA category collation rather than the default "uppercase" collation. The Unicode Collation Algorithm (UCA) is the official standard for how to sort Unicode characters, and generally follows how a human would typically alphabetize strings. For example, uppercase collation sorts Aztec, Ärsenik, Zoo, Aardvark as "Aardvark, Aztec, Zoo, Ärsenik", but uca-default collation sorts them as "Aardvark, Ärsenik, Aztec, Zoo". UCA collation also (optionally) supports natural numeric sorting so that 100, 1, 99 sorts as "1, 99, 100" rather than "1, 100, 99". The WMF Community Tech team has recently posted proposals on English Wikipedia and several Wiktionaries asking if these communities would support switching to UCA collation. The proposal on English Wikipedia has received unanimous support so far.[1] We thought that Wiktionaries would be more skeptical of the change, but so far we have received only positive responses.[2]
Since it seems that most wikis are receptive to switching to UCA, maybe we should just make it the default rather than waiting on all the wikis to request it separately. Of the large Wikipedias, French, Dutch, Polish, Portuguese, and Russian are already using UCA, and German is in the process of switching.[3] For non-Latin scripts, my understanding is that UCA will be a big improvement, especially if we switch them to language-specific implementations, like uca-ja, uca-zh, uca-ar, etc.
Three questions: 1. Does switching the default collation from "uppercase" to "uca-default" sound like a good idea? 2. Should this be proposed on meta or is it too technical? 3. Are there any wikis that would need to opt out of this for some reason? (I know there are issues with Kurdish,[4] but that's the only one I know about.)
1. https://en.wikipedia.org/wiki/Wikipedia_talk:Categorization#OK_to_switch_Eng... 2. https://phabricator.wikimedia.org/T128502 3. https://phabricator.wikimedia.org/T128806 4. https://phabricator.wikimedia.org/T48235
Yes, of course, & a meta discussion will likely unearth many reasons to opt-out ;)
Does uca (or extension) do the right thing for West Frisian (fy) wrt y & i ?
Or, ... it would be helpful to put the list of 94 wiki somewhere easy to consume. On 28 May 2016 06:37, "Ryan Kaldari" rkaldari@wikimedia.org wrote:
There are currently 94 WMF wikis using UCA category collation rather than the default "uppercase" collation. The Unicode Collation Algorithm (UCA) is the official standard for how to sort Unicode characters, and generally follows how a human would typically alphabetize strings. For example, uppercase collation sorts Aztec, Ärsenik, Zoo, Aardvark as "Aardvark, Aztec, Zoo, Ärsenik", but uca-default collation sorts them as "Aardvark, Ärsenik, Aztec, Zoo". UCA collation also (optionally) supports natural numeric sorting so that 100, 1, 99 sorts as "1, 99, 100" rather than "1, 100, 99". The WMF Community Tech team has recently posted proposals on English Wikipedia and several Wiktionaries asking if these communities would support switching to UCA collation. The proposal on English Wikipedia has received unanimous support so far.[1] We thought that Wiktionaries would be more skeptical of the change, but so far we have received only positive responses.[2]
Since it seems that most wikis are receptive to switching to UCA, maybe we should just make it the default rather than waiting on all the wikis to request it separately. Of the large Wikipedias, French, Dutch, Polish, Portuguese, and Russian are already using UCA, and German is in the process of switching.[3] For non-Latin scripts, my understanding is that UCA will be a big improvement, especially if we switch them to language-specific implementations, like uca-ja, uca-zh, uca-ar, etc.
Three questions:
- Does switching the default collation from "uppercase" to "uca-default"
sound like a good idea? 2. Should this be proposed on meta or is it too technical? 3. Are there any wikis that would need to opt out of this for some reason? (I know there are issues with Kurdish,[4] but that's the only one I know about.)
https://en.wikipedia.org/wiki/Wikipedia_talk:Categorization#OK_to_switch_Eng... 2. https://phabricator.wikimedia.org/T128502 3. https://phabricator.wikimedia.org/T128806 4. https://phabricator.wikimedia.org/T48235 _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
John Mark Vandenberg wrote:
Yes, of course, & a meta discussion will likely unearth many reasons to opt-out ;)
Does uca (or extension) do the right thing for West Frisian (fy) wrt y & i ?
Or, ... it would be helpful to put the list of 94 wiki somewhere easy to consume.
My count of wikis not using "uppercase" is a bit lower: 89. I started this page: https://meta.wikimedia.org/wiki/Collation.
I agree with having a discussion on Meta-Wiki. I think this type of larger undertaking also requires close coordination with a database administrator, if it requires running maintenance/updateCollation.php.
MZMcBride
On Friday, May 27, 2016, Ryan Kaldari rkaldari@wikimedia.org wrote:
There are currently 94 WMF wikis using UCA category collation rather than the default "uppercase" collation. The Unicode Collation Algorithm (UCA)
is
the official standard for how to sort Unicode characters, and generally follows how a human would typically alphabetize strings. For example, uppercase collation sorts Aztec, Ärsenik, Zoo, Aardvark as "Aardvark, Aztec, Zoo, Ärsenik", but uca-default collation sorts them as "Aardvark, Ärsenik, Aztec, Zoo". UCA collation also (optionally) supports natural numeric sorting so that 100, 1, 99 sorts as "1, 99, 100" rather than "1, 100, 99". The WMF Community Tech team has recently posted proposals on English Wikipedia and several Wiktionaries asking if these communities would support switching to UCA collation. The proposal on English
Wikipedia
has received unanimous support so far.[1] We thought that Wiktionaries would be more skeptical of the change, but so far we have received only positive responses.[2]
Since it seems that most wikis are receptive to switching to UCA, maybe we should just make it the default rather than waiting on all the wikis to request it separately. Of the large Wikipedias, French, Dutch, Polish, Portuguese, and Russian are already using UCA, and German is in the
process
of switching.[3] For non-Latin scripts, my understanding is that UCA will be a big improvement, especially if we switch them to language-specific implementations, like uca-ja, uca-zh, uca-ar, etc.
Three questions:
- Does switching the default collation from "uppercase" to "uca-default"
sound like a good idea? 2. Should this be proposed on meta or is it too technical? 3. Are there any wikis that would need to opt out of this for some reason? (I know there are issues with Kurdish,[4] but that's the only one I know about.)
https://en.wikipedia.org/wiki/Wikipedia_talk:Categorization#OK_to_switch_Eng...
- https://phabricator.wikimedia.org/T128502
- https://phabricator.wikimedia.org/T128806
- https://phabricator.wikimedia.org/T48235
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
I think we should start with the one's that actually have locales in the icu project. Well perhaps for other languages, uca-default is a better fallback, starting with the ones that have been specificly checked by upstream as being a good match sounds like a less controversial first step.
For numeric, id suggest it be actually deployed somewhere first (not to mention actually written) in case there are unexpected issues, before talking about deploying it everywhere.
-- Bawolff
On Friday, May 27, 2016, John Mark Vandenberg jayvdb@gmail.com wrote:
Yes, of course, & a meta discussion will likely unearth many reasons to opt-out ;)
Does uca (or extension) do the right thing for West Frisian (fy) wrt y &
i ?
Or, ... it would be helpful to put the list of 94 wiki somewhere easy to consume.
Fy is not on the list at https://ssl.icu-project.org/trac/browser/icu/trunk/source/data/coll?order=na... . As a general rule any language not on that list that does something that either conflicts with english, or has something complicated (such as having letters with diacretics being considered a full letter to be sorted seperately (in the terminology of UCA having a primary weight difference) ) will probably not work fully correctly with the uca collation. Of course uca-default still might be a better fallback then the current system depending on the language.
-- bawolff
OK, so it sounds like the best way forward is:
1. For any wikis that have specific language versions of uca collation available (for example, uca-fr), but haven't yet switched to it, go ahead and switch them to that collation. This should take care of a few dozen wikis.
2. Start a discussion on Meta wiki about potentially changing the default collation from uppercase to uca-default and (hopefully) find out if this would cause any problems or if there are wikis that would want to opt out (or if it's just a bad idea in general).
Does that sound reasonable to everyone?
On Sat, May 28, 2016 at 8:54 AM, Brian Wolff bawolff@gmail.com wrote:
On Friday, May 27, 2016, John Mark Vandenberg jayvdb@gmail.com wrote:
Yes, of course, & a meta discussion will likely unearth many reasons to opt-out ;)
Does uca (or extension) do the right thing for West Frisian (fy) wrt y &
i ?
Or, ... it would be helpful to put the list of 94 wiki somewhere easy to consume.
Fy is not on the list at
https://ssl.icu-project.org/trac/browser/icu/trunk/source/data/coll?order=na... . As a general rule any language not on that list that does something that either conflicts with english, or has something complicated (such as having letters with diacretics being considered a full letter to be sorted seperately (in the terminology of UCA having a primary weight difference) ) will probably not work fully correctly with the uca collation. Of course uca-default still might be a better fallback then the current system depending on the language.
-- bawolff _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Yes, that sounds reasonable :-) Purodha
On 02.06.2016 19:24, Ryan Kaldari wrote:
OK, so it sounds like the best way forward is:
- For any wikis that have specific language versions of uca collation
available (for example, uca-fr), but haven't yet switched to it, go ahead and switch them to that collation. This should take care of a few dozen wikis.
- Start a discussion on Meta wiki about potentially changing the
default collation from uppercase to uca-default and (hopefully) find out if this would cause any problems or if there are wikis that would want to opt out (or if it's just a bad idea in general).
Does that sound reasonable to everyone?
On Sat, May 28, 2016 at 8:54 AM, Brian Wolff bawolff@gmail.com wrote:
On Friday, May 27, 2016, John Mark Vandenberg jayvdb@gmail.com wrote:
Yes, of course, & a meta discussion will likely unearth many reasons to opt-out ;)
Does uca (or extension) do the right thing for West Frisian (fy) wrt y &
i ?
Or, ... it would be helpful to put the list of 94 wiki somewhere easy to consume.
Fy is not on the list at
https://ssl.icu-project.org/trac/browser/icu/trunk/source/data/coll?order=na... . As a general rule any language not on that list that does something that either conflicts with english, or has something complicated (such as having letters with diacretics being considered a full letter to be sorted seperately (in the terminology of UCA having a primary weight difference) ) will probably not work fully correctly with the uca collation. Of course uca-default still might be a better fallback then the current system depending on the language.
-- bawolff _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Meanwhile it.wiki entusiastically joined UCA wikis.
Vito
2016-06-02 21:42 GMT+02:00 Purodha Blissenbach purodha@blissenbach.org:
Yes, that sounds reasonable :-) Purodha
On 02.06.2016 19:24, Ryan Kaldari wrote:
OK, so it sounds like the best way forward is:
- For any wikis that have specific language versions of uca collation
available (for example, uca-fr), but haven't yet switched to it, go ahead and switch them to that collation. This should take care of a few dozen wikis.
- Start a discussion on Meta wiki about potentially changing the default
collation from uppercase to uca-default and (hopefully) find out if this would cause any problems or if there are wikis that would want to opt out (or if it's just a bad idea in general).
Does that sound reasonable to everyone?
On Sat, May 28, 2016 at 8:54 AM, Brian Wolff bawolff@gmail.com wrote:
On Friday, May 27, 2016, John Mark Vandenberg jayvdb@gmail.com wrote:
Yes, of course, & a meta discussion will likely unearth many reasons to opt-out ;)
Does uca (or extension) do the right thing for West Frisian (fy) wrt y
& i ?
Or, ... it would be helpful to put the list of 94 wiki somewhere easy
to
consume.
Fy is not on the list at
https://ssl.icu-project.org/trac/browser/icu/trunk/source/data/coll?order=na... . As a general rule any language not on that list that does something that either conflicts with english, or has something complicated (such as having letters with diacretics being considered a full letter to be sorted seperately (in the terminology of UCA having a primary weight difference) ) will probably not work fully correctly with the uca collation. Of course uca-default still might be a better fallback then the current system depending on the language.
-- bawolff _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
I agree with having a discussion on Meta-Wiki. I think this type of larger undertaking also requires close coordination with a database administrator, if it requires running maintenance/updateCollation.php.
A (the?) DBA has been closely following and helping the team doing the latest changes to the script, making sure it executes faster and with non-impacting load. He is ok with the current implementation and happy to let it run (one wiki at a time).
I'd suggest a different scheme.
Create a set of pages (at meta) explaining the pros, set a deadline for change and use a notice/global message delivering asking wikis to opt out if they disagree.
Vito
Il 03/06/2016 12:12, Jaime Crespo ha scritto:
I agree with having a discussion on Meta-Wiki. I think this type of larger undertaking also requires close coordination with a database administrator, if it requires running maintenance/updateCollation.php.
A (the?) DBA has been closely following and helping the team doing the latest changes to the script, making sure it executes faster and with non-impacting load. He is ok with the current implementation and happy to let it run (one wiki at a time).
wikitech-l@lists.wikimedia.org