[i18n software news] Collation for Bashkir

List overview All Threads
Download

newer

older

Request for closure of Nahuatl...

Request for respond on phabricator...

Amir E. Aharoni

13 Jun 2017 13 Jun '17

2:09 a.m.

Hi,

Another edition of i18n software news!

Yesterday, a change was deployed in the Bashkir Wikipedia: The categories are now sorted in the correct alphabetical order.

Bashkir, like many languages of the Soviet Union, uses the Cyrillic alphabet with several extra characters. Without proper software support, the extra letters are sorted according according to their Unicode character number order, which is not very useful. For example, the letter Ө is supposed to be in the middle of the alphabet between О and П, but without correct collation it's in the end, so Ufa (Өфө), the capital of Bashkortostan, appears in the very end of the alphabet in the "Capitals of Russian regions" category [1] , but now it appears correctly before П.

This could be resolved by adding the collation for this language to CLDR and ICU, and I filed a ticket about this with CLDR [2]. Actually getting it added and deployed is a long process, but the MediaWiki developer Brian Wolff provided a good interim solution in MediaWiki code itself. The infrastructure code around it is surprisingly tricky, but to simply add a new alphabet, you just need to create a file like this: https://phabricator.wikimedia.org/source/mediawiki/browse/master/includes/co...

When it is added to CLDR and ICU, this stopgap solution can be removed from MediaWiki.

As far as I can see, Bashkir is the first language for which such a comprehensive solution was made inside MediaWiki, and it is needed for many others. I'll start looking for other languages where this is needed. My process would be something like this: 1. Find a languages in which there is a Wikipedia with incorrect collation. 2. Find the correct alphabetical order, using a grammar book or a dictionary, and confirm it with editors in that language. 3. Submit a ticket to CLDR. 4. Add a file with an alphabet, like the Bashkir file above, to MediaWiki core. 5. Get it reviewed, merged, and deployed. 6. Deploy the change to the projects in that language. 7. Run a script that converts the categories to the new collation.

(Steps 5 and 6 sound repetitive because it needs to explicitly enabled for each wiki. I filed another bug [4], which suggests defining a default collation per language, so that step 6 won't be needed.)

If anybody has better suggestions about working with CLDR and ICU and getting them to add and release these collation files faster, I'll be very happy to hear them.

[1] http://bit.ly/2sWLJaX [2] http://unicode.org/cldr/trac/ticket/10195 [3] For the confirmation about Bashkir see https://phabricator.wikimedia.org/T162823 . [4] https://phabricator.wikimedia.org/T164985

-- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com ‪“We're living in pieces, I want to live in peace.” – T. Moore‬

Attachments:

attachment.htm (text/html — 3.7 KB)

Show replies by date

Michael Everson

13 Jun 13 Jun

1:20 p.m.

HUZZAH!

...

On 13 Jun 2017, at 07:09, Amir E. Aharoni amir.aharoni@mail.huji.ac.il wrote:

Hi,

Another edition of i18n software news!

Yesterday, a change was deployed in the Bashkir Wikipedia: The categories are now sorted in the correct alphabetical order.

Bashkir, like many languages of the Soviet Union, uses the Cyrillic alphabet with several extra characters. Without proper software support, the extra letters are sorted according according to their Unicode character number order, which is not very useful. For example, the letter Ө is supposed to be in the middle of the alphabet between О and П, but without correct collation it's in the end, so Ufa (Өфө), the capital of Bashkortostan, appears in the very end of the alphabet in the "Capitals of Russian regions" category [1] , but now it appears correctly before П.

This could be resolved by adding the collation for this language to CLDR and ICU, and I filed a ticket about this with CLDR [2]. Actually getting it added and deployed is a long process, but the MediaWiki developer Brian Wolff provided a good interim solution in MediaWiki code itself. The infrastructure code around it is surprisingly tricky, but to simply add a new alphabet, you just need to create a file like this: https://phabricator.wikimedia.org/source/mediawiki/browse/master/includes/co...

When it is added to CLDR and ICU, this stopgap solution can be removed from MediaWiki.

As far as I can see, Bashkir is the first language for which such a comprehensive solution was made inside MediaWiki, and it is needed for many others. I'll start looking for other languages where this is needed. My process would be something like this:

Find a languages in which there is a Wikipedia with incorrect collation.

Find the correct alphabetical order, using a grammar book or a dictionary, and confirm it with editors in that language.

Submit a ticket to CLDR.

Add a file with an alphabet, like the Bashkir file above, to MediaWiki core.

Get it reviewed, merged, and deployed.

Deploy the change to the projects in that language.

Run a script that converts the categories to the new collation.

(Steps 5 and 6 sound repetitive because it needs to explicitly enabled for each wiki. I filed another bug [4], which suggests defining a default collation per language, so that step 6 won't be needed.)

If anybody has better suggestions about working with CLDR and ICU and getting them to add and release these collation files faster, I'll be very happy to hear them.

[1] http://bit.ly/2sWLJaX [2] http://unicode.org/cldr/trac/ticket/10195 [3] For the confirmation about Bashkir see https://phabricator.wikimedia.org/T162823 . [4] https://phabricator.wikimedia.org/T164985

-- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com ‪“We're living in pieces, I want to live in peace.” – T. Moore‬ _______________________________________________ Langcom mailing list Langcom@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/langcom

Jon Harald Søby

5:27 p.m.

Ah, this is great, Amir. I remember for years we had similar problems in the Norwegian Wikipedias, where the last three letters of the alphabet were sorted Å–Æ–Ø, instead of the correct Æ–Ø–Å. But for us it was a minor issue since they were all at the end anyways, but I can really see why it is a much bigger issue for Bashkir where letters in the middle of the alphabet are sorted at the end.

Also, minor unrelated question: Is the "[i18n software news]" tag in the subject of this thread something you added yourself, or is there an awesome mailing list I'm missing out on?

2017-06-13 19:20 GMT+02:00 Michael Everson everson@evertype.com:

...

HUZZAH!

...
On 13 Jun 2017, at 07:09, Amir E. Aharoni amir.aharoni@mail.huji.ac.il

wrote:

...
Hi,

Another edition of i18n software news!

Yesterday, a change was deployed in the Bashkir Wikipedia: The

categories are now sorted in the correct alphabetical order.

...
Bashkir, like many languages of the Soviet Union, uses the Cyrillic

alphabet with several extra characters. Without proper software support, the extra letters are sorted according according to their Unicode character number order, which is not very useful. For example, the letter Ө is supposed to be in the middle of the alphabet between О and П, but without correct collation it's in the end, so Ufa (Өфө), the capital of Bashkortostan, appears in the very end of the alphabet in the "Capitals of Russian regions" category [1] , but now it appears correctly before П.

...
This could be resolved by adding the collation for this language to CLDR

and ICU, and I filed a ticket about this with CLDR [2]. Actually getting it added and deployed is a long process, but the MediaWiki developer Brian Wolff provided a good interim solution in MediaWiki code itself. The infrastructure code around it is surprisingly tricky, but to simply add a new alphabet, you just need to create a file like this:

...
https://phabricator.wikimedia.org/source/mediawiki/browse/

master/includes/collation/BashkirUppercaseCollation.php

...
When it is added to CLDR and ICU, this stopgap solution can be removed

from MediaWiki.

...
As far as I can see, Bashkir is the first language for which such a

comprehensive solution was made inside MediaWiki, and it is needed for many others. I'll start looking for other languages where this is needed. My process would be something like this:

...

Find a languages in which there is a Wikipedia with incorrect

collation.

...

Find the correct alphabetical order, using a grammar book or a

dictionary, and confirm it with editors in that language.

...

Submit a ticket to CLDR.

Add a file with an alphabet, like the Bashkir file above, to

MediaWiki core.

...

Get it reviewed, merged, and deployed.

Deploy the change to the projects in that language.

Run a script that converts the categories to the new collation.

(Steps 5 and 6 sound repetitive because it needs to explicitly enabled

for each wiki. I filed another bug [4], which suggests defining a default collation per language, so that step 6 won't be needed.)

...
If anybody has better suggestions about working with CLDR and ICU and

getting them to add and release these collation files faster, I'll be very happy to hear them.

...
[1] http://bit.ly/2sWLJaX [2] http://unicode.org/cldr/trac/ticket/10195 [3] For the confirmation about Bashkir see

https://phabricator.wikimedia.org/T162823 .

...
[4] https://phabricator.wikimedia.org/T164985

-- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com ‪“We're living in pieces, I want to live in peace.” – T. Moore‬ _______________________________________________ Langcom mailing list Langcom@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/langcom

Langcom mailing list Langcom@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/langcom

-- mvh Jon Harald Søby

Amir E. Aharoni

8:44 p.m.

2017-06-14 0:27 GMT+03:00 Jon Harald Søby jhsoby@gmail.com:

...

Also, minor unrelated question: Is the "[i18n software news]" tag in the subject of this thread something you added yourself, or is there an awesome mailing list I'm missing out on?

Myself :)

-- Amir

Amir E. Aharoni

24 Sep 24 Sep

7:39 a.m.

A somewhat belated note about this topic...

The official Wikimedia blog published a post about this: https://blog.wikimedia.org/2017/09/20/bashkir-collation/

It's the first post on that blog to be available in Bashkir translation (yay to our volunteers), but there's a more interesting point.

In the comments to the post there's a question: Can the same be done for Thai? My immediate reaction was surprise: Despite having a relatively complex script, Thai has probably been the best-supported Southeast Asian language in software for a long time; doesn't it have collation support in ICU already?

It looks like Thai is supported there already, but the collation for it is not enabled on our sites. Enabling it is probably easy (see https://phabricator.wikimedia.org/T176434 ), but this raises the question: Would this be a good idea to set a default collation in MediaWiki core rather than doing it on each *site* separately?

Currently, collation rules are not enabled by default, even if ICU supports them. Categories will show page names in the order of Unicode characters. If another collation is enabled in the site configuration, then it will be used. So this must be done manually for Wikipedia, Wiktionary, Wikivoyage, etc., and it must also be done by each non-Wikimedia MediaWiki user. To me it makes sense that if the site language is Thai, the default Thai collation will be used, unless specified otherwise, and the same thinking should be for all other languages. However, I might be missing something, and there are much better collation experts than me on this mailing list, so I'd love to hear your opinions.

Thanks!

-- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com ‪“We're living in pieces, I want to live in peace.” – T. Moore‬

2017-06-13 9:09 GMT+03:00 Amir E. Aharoni amir.aharoni@mail.huji.ac.il:

...

Hi,

Another edition of i18n software news!

Yesterday, a change was deployed in the Bashkir Wikipedia: The categories are now sorted in the correct alphabetical order.

Bashkir, like many languages of the Soviet Union, uses the Cyrillic alphabet with several extra characters. Without proper software support, the extra letters are sorted according according to their Unicode character number order, which is not very useful. For example, the letter Ө is supposed to be in the middle of the alphabet between О and П, but without correct collation it's in the end, so Ufa (Өфө), the capital of Bashkortostan, appears in the very end of the alphabet in the "Capitals of Russian regions" category [1] , but now it appears correctly before П.

This could be resolved by adding the collation for this language to CLDR and ICU, and I filed a ticket about this with CLDR [2]. Actually getting it added and deployed is a long process, but the MediaWiki developer Brian Wolff provided a good interim solution in MediaWiki code itself. The infrastructure code around it is surprisingly tricky, but to simply add a new alphabet, you just need to create a file like this: https://phabricator.wikimedia.org/source/mediawiki/browse/ master/includes/collation/BashkirUppercaseCollation.php

When it is added to CLDR and ICU, this stopgap solution can be removed from MediaWiki.

As far as I can see, Bashkir is the first language for which such a comprehensive solution was made inside MediaWiki, and it is needed for many others. I'll start looking for other languages where this is needed. My process would be something like this:

Find a languages in which there is a Wikipedia with incorrect collation.

Find the correct alphabetical order, using a grammar book or a

dictionary, and confirm it with editors in that language. 3. Submit a ticket to CLDR. 4. Add a file with an alphabet, like the Bashkir file above, to MediaWiki core. 5. Get it reviewed, merged, and deployed. 6. Deploy the change to the projects in that language. 7. Run a script that converts the categories to the new collation.

(Steps 5 and 6 sound repetitive because it needs to explicitly enabled for each wiki. I filed another bug [4], which suggests defining a default collation per language, so that step 6 won't be needed.)

If anybody has better suggestions about working with CLDR and ICU and getting them to add and release these collation files faster, I'll be very happy to hear them.

[1] http://bit.ly/2sWLJaX [2] http://unicode.org/cldr/trac/ticket/10195 [3] For the confirmation about Bashkir see https://phabricator.wikimedia. org/T162823 . [4] https://phabricator.wikimedia.org/T164985

-- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com ‪“We're living in pieces, I want to live in peace.” – T. Moore‬

2646

Age (days ago)

2749

Last active (days ago)

langcom@lists.wikimedia.org

4 comments

3 participants

tags (0)

participants (3)

Amir E. Aharoni
Jon Harald Søby
Michael Everson