Re: [Langcom] [i18n software news] Collation for Bashkir

13 Jun 2017

      Ah, this is great, Amir. I remember for years we had similar problems in
the Norwegian Wikipedias, where the last three letters of the alphabet were
sorted Å–Æ–Ø, instead of the correct Æ–Ø–Å. But for us it was a minor issue
since they were all at the end anyways, but I can really see why it is a
much bigger issue for Bashkir where letters in the middle of the alphabet
are sorted at the end.
Also, minor unrelated question: Is the "[i18n software news]" tag in the
subject of this thread something you added yourself, or is there an awesome
mailing list I'm missing out on?
2017-06-13 19:20 GMT+02:00 Michael Everson everson@evertype.com:
...
HUZZAH!
...
On 13 Jun 2017, at 07:09, Amir E. Aharoni amir.aharoni@mail.huji.ac.il
wrote:
...
Hi,
Another edition of i18n software news!
Yesterday, a change was deployed in the Bashkir Wikipedia: The
categories are now sorted in the correct alphabetical order.
...
Bashkir, like many languages of the Soviet Union, uses the Cyrillic
alphabet with several extra characters. Without proper software support,
the extra letters are sorted according according to their Unicode character
number order, which is not very useful. For example, the letter Ө is
supposed to be in the middle of the alphabet between О and П, but without
correct collation it's in the end, so Ufa (Өфө), the capital of
Bashkortostan, appears in the very end of the alphabet in the "Capitals of
Russian regions" category [1] , but now it appears correctly before П.
...
This could be resolved by adding the collation for this language to CLDR
and ICU, and I filed a ticket about this with CLDR [2]. Actually getting it
added and deployed is a long process, but the MediaWiki developer Brian
Wolff provided a good interim solution in MediaWiki code itself. The
infrastructure code around it is surprisingly tricky, but to simply add a
new alphabet, you just need to create a file like this:
...
https://phabricator.wikimedia.org/source/mediawiki/browse/
master/includes/collation/BashkirUppercaseCollation.php
...
When it is added to CLDR and ICU, this stopgap solution can be removed
from MediaWiki.
...
As far as I can see, Bashkir is the first language for which such a
comprehensive solution was made inside MediaWiki, and it is needed for many
others. I'll start looking for other languages where this is needed. My
process would be something like this:
...

Find a languages in which there is a Wikipedia with incorrect

collation.
...

Find the correct alphabetical order, using a grammar book or a

dictionary, and confirm it with editors in that language.
...

Submit a ticket to CLDR.
Add a file with an alphabet, like the Bashkir file above, to

MediaWiki core.
...

Get it reviewed, merged, and deployed.
Deploy the change to the projects in that language.
Run a script that converts the categories to the new collation.

(Steps 5 and 6 sound repetitive because it needs to explicitly enabled
for each wiki. I filed another bug [4], which suggests defining a default
collation per language, so that step 6 won't be needed.)
...
If anybody has better suggestions about working with CLDR and ICU and
getting them to add and release these collation files faster, I'll be very
happy to hear them.
...
[1] http://bit.ly/2sWLJaX
[2] http://unicode.org/cldr/trac/ticket/10195
[3] For the confirmation about Bashkir see
https://phabricator.wikimedia.org/T162823 .
...
[4] https://phabricator.wikimedia.org/T164985
--
Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
http://aharoni.wordpress.com
‪“We're living in pieces,
I want to live in peace.” – T. Moore‬
_______________________________________________
Langcom mailing list
Langcom@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/langcom

Langcom mailing list
Langcom@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/langcom
-- 
mvh
Jon Harald Søby

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Re: [Langcom] [i18n software news] Collation for Bashkir