On 13 Jun 2017, at 07:09, Amir E. Aharoni
<amir.aharoni(a)mail.huji.ac.il> wrote:
Hi,
Another edition of i18n software news!
Yesterday, a change was deployed in the Bashkir Wikipedia: The categories are now sorted
in the correct alphabetical order.
Bashkir, like many languages of the Soviet Union, uses the Cyrillic alphabet with several
extra characters. Without proper software support, the extra letters are sorted according
according to their Unicode character number order, which is not very useful. For example,
the letter Ө is supposed to be in the middle of the alphabet between О and П, but without
correct collation it's in the end, so Ufa (Өфө), the capital of Bashkortostan, appears
in the very end of the alphabet in the "Capitals of Russian regions" category
[1] , but now it appears correctly before П.
This could be resolved by adding the collation for this language to CLDR and ICU, and I
filed a ticket about this with CLDR [2]. Actually getting it added and deployed is a long
process, but the MediaWiki developer Brian Wolff provided a good interim solution in
MediaWiki code itself. The infrastructure code around it is surprisingly tricky, but to
simply add a new alphabet, you just need to create a file like this:
https://phabricator.wikimedia.org/source/mediawiki/browse/master/includes/c…
When it is added to CLDR and ICU, this stopgap solution can be removed from MediaWiki.
As far as I can see, Bashkir is the first language for which such a comprehensive
solution was made inside MediaWiki, and it is needed for many others. I'll start
looking for other languages where this is needed. My process would be something like
this:
1. Find a languages in which there is a Wikipedia with incorrect collation.
2. Find the correct alphabetical order, using a grammar book or a dictionary, and confirm
it with editors in that language.
3. Submit a ticket to CLDR.
4. Add a file with an alphabet, like the Bashkir file above, to MediaWiki core.
5. Get it reviewed, merged, and deployed.
6. Deploy the change to the projects in that language.
7. Run a script that converts the categories to the new collation.
(Steps 5 and 6 sound repetitive because it needs to explicitly enabled for each wiki. I
filed another bug [4], which suggests defining a default collation per language, so that
step 6 won't be needed.)
If anybody has better suggestions about working with CLDR and ICU and getting them to add
and release these collation files faster, I'll be very happy to hear them.
[1]
http://bit.ly/2sWLJaX
[2]
http://unicode.org/cldr/trac/ticket/10195
[3] For the confirmation about Bashkir see
https://phabricator.wikimedia.org/T162823 .
[4]
https://phabricator.wikimedia.org/T164985
--
Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
http://aharoni.wordpress.com
“We're living in pieces,
I want to live in peace.” – T. Moore
_______________________________________________
Langcom mailing list
Langcom(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/langcom