Proper category collation support finally implemented!

List overview All Threads
Download

newer

older

Release policy

Deployment Highlights - 2013-03-01

Bartosz Dziewoński

27 Feb 2013 27 Feb '13

10:50 p.m.

Just yesterday I managed to get https://gerrit.wikimedia.org/r/#/c/49776/ merged. Based heavily on Tim's work on the IcuCollation, it allows one to *finally* get articles to be correctly sorted on category pages for 67 languages based in latin, greek and cyrillic alphabets.

I also created https://bugzilla.wikimedia.org/show_bug.cgi?id=45443 to track the process of getting this deployed to Wikimedia wikis. The process is already underway for uk.wiki and pl.wiki; if anybody technical wishes to get it on their wiki first, please create a sub-bug and start a community discussion/vote - I can provide a testwiki in your language :)

Eventually, I'd like this to be deployed on all wikis in those 67 languages. I'll start poking people about this (and will drop a mail to -ambassadors) once wmf11 is deployed and the change goes live on a few wikis.

-- Matma Rex

Show replies by date

Bináris

27 Feb 27 Feb

11:11 p.m.

New subject: Proper category collation support finally implemented!

Oh no! You mean bug#164 will be solved now in less than nine years? That's a great day, hurray!!!!!!!!! :-))))))) And thanks a lot. When is it scheduled to be deployed?

2013/2/27 Bartosz Dziewoński matma.rex@gmail.com

...

Just yesterday I managed to get https://gerrit.wikimedia.org/** r/#/c/49776/ https://gerrit.wikimedia.org/r/#/c/49776/ merged. Based heavily on Tim's work on the IcuCollation, it allows one to *finally* get articles to be correctly sorted on category pages for 67 languages based in latin, greek and cyrillic alphabets.

I also created https://bugzilla.wikimedia.**org/show_bug.cgi?id=45443 https://bugzilla.wikimedia.org/show_bug.cgi?id=45443to track the process of getting this deployed to Wikimedia wikis. The process is already underway for uk.wiki and pl.wiki; if anybody technical wishes to get it on their wiki first, please create a sub-bug and start a community discussion/vote - I can provide a testwiki in your language :)

Eventually, I'd like this to be deployed on all wikis in those 67 languages. I'll start poking people about this (and will drop a mail to -ambassadors) once wmf11 is deployed and the change goes live on a few wikis.

-- Matma Rex

______________________________**_________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikitech-l https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- Bináris

Bartosz Dziewoński

11:31 p.m.

On Thu, 28 Feb 2013 00:11:19 +0100, Bináris wikiposta@gmail.com wrote:

...

When is it scheduled to be deployed?

The code change itself will go live with MW 1.21wmf11 (see https://www.mediawiki.org/wiki/MediaWiki_1.21/Roadmap for deployment dates), and I'll try to get the configuration changes deployed on pl.wiki (and possibly uk.wiki as well) shortly afterwards.

There's no Great Deployment Plan (yet), and I don't have enough free time (nor access to WMF resources) to draft one. As I said, I'll mail the -ambassadors list, set up testwikis and submit config change proposals for anyone who wishes to have one, and that's probably all I can do. I'll try to poke some Wikimedia communities about this, though (especially ones with particularly fanciful ;) alphabets).

-- Matma Rex

Siebrand Mazeland (WMF)

11:15 p.m.

New subject: Proper category collation support finally implemented!

Op 27 feb. 2013 om 23:50 heeft Bartosz Dziewoński matma.rex@gmail.com het volgende geschreven:

...

Just yesterday I managed to get https://gerrit.wikimedia.org/r/#/c/49776/ merged. Based heavily on Tim's work on the IcuCollation, it allows one to *finally* get articles to be correctly sorted on category pages for 67 languages based in latin, greek and cyrillic alphabets.

Nice work, Bartosz. Thank you for all your efforts. Don't stop there, go for gold and beyond the three scripts you can read ;).

Cheers!

-- Siebrand Mazeland

M: +31 6 50 69 1239 Skype: siebrand

Paul Selitskas

11:20 p.m.

New subject: Proper category collation support finally implemented!

Does this need any maintenance/* runs? I want to test this for Belarusian (be + be-tarask), although now I have what I had before the git pull.

On Thu, Feb 28, 2013 at 1:50 AM, Bartosz Dziewoński matma.rex@gmail.com wrote:

...

Just yesterday I managed to get https://gerrit.wikimedia.org/r/#/c/49776/ merged. Based heavily on Tim's work on the IcuCollation, it allows one to *finally* get articles to be correctly sorted on category pages for 67 languages based in latin, greek and cyrillic alphabets.

I also created https://bugzilla.wikimedia.org/show_bug.cgi?id=45443 to track the process of getting this deployed to Wikimedia wikis. The process is already underway for uk.wiki and pl.wiki; if anybody technical wishes to get it on their wiki first, please create a sub-bug and start a community discussion/vote - I can provide a testwiki in your language :)

Eventually, I'd like this to be deployed on all wikis in those 67 languages. I'll start poking people about this (and will drop a mail to -ambassadors) once wmf11 is deployed and the change goes live on a few wikis.

-- Matma Rex

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- З павагай, Павел Селіцкас/Pavel Selitskas Wizardist @ Wikimedia projects

Bartosz Dziewoński

11:27 p.m.

On Thu, 28 Feb 2013 00:20:09 +0100, Paul Selitskas p.selitskas@gmail.com wrote:

...

Does this need any maintenance/* runs? I want to test this for Belarusian (be + be-tarask), although now I have what I had before the git pull.

Yes, you need to run maintenance/updateCollation.php and then purge all category pages.

And if you run into any weird display bugs (like letters sorting under headings containing weird symbols), check out https://bugzilla.wikimedia.org/show_bug.cgi?id=43740 .

-- Matma Rex

Paul Selitskas

11:33 p.m.

New subject: Proper category collation support finally implemented!

I had to add 'be-tarask' to $tailoringFirstLetters and set $wgCategoryCollation explicitly to make this thing work. But it damn works! Awesome, thanks!

Can character mapping be also implemented here? For example, in Belarusian letter «Ґ» should be in the same section as «Г», and «Ў» in the same section as «У». It's not an urgent request, just my curiosity.

On Thu, Feb 28, 2013 at 2:27 AM, Bartosz Dziewoński matma.rex@gmail.com wrote:

...

On Thu, 28 Feb 2013 00:20:09 +0100, Paul Selitskas p.selitskas@gmail.com wrote:

...
Does this need any maintenance/* runs? I want to test this for Belarusian (be + be-tarask), although now I have what I had before the git pull.

Yes, you need to run maintenance/updateCollation.php and then purge all category pages.

And if you run into any weird display bugs (like letters sorting under headings containing weird symbols), check out https://bugzilla.wikimedia.org/show_bug.cgi?id=43740 .

-- Matma Rex

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- З павагай, Павел Селіцкас/Pavel Selitskas Wizardist @ Wikimedia projects

Brian Wolff

28 Feb 28 Feb

2:40 a.m.

New subject: Proper category collation support finally implemented!

Hmm. The collation chart for "be" [1] doesnt seem to mention Ґ. It does mention ѓ though which looks kind of similar to my untrained eye. In any case, If the lack of Ґ being specified is incorrect it is an upstream issue with either the icu project or the cldr project (I think).

[1] http://collation-charts.org/icu442/icu442-be.html

-bawolff

On 2013-02-27 7:34 PM, "Paul Selitskas" p.selitskas@gmail.com wrote:

...

I had to add 'be-tarask' to $tailoringFirstLetters and set $wgCategoryCollation explicitly to make this thing work. But it damn works! Awesome, thanks!

Can character mapping be also implemented here? For example, in Belarusian letter «Ґ» should be in the same section as «Г», and «Ў» in the same section as «У». It's not an urgent request, just my curiosity.

On Thu, Feb 28, 2013 at 2:27 AM, Bartosz Dziewoński matma.rex@gmail.com

wrote:

...

...
On Thu, 28 Feb 2013 00:20:09 +0100, Paul Selitskas <

p.selitskas@gmail.com>

...

...
wrote:

...
Does this need any maintenance/* runs? I want to test this for Belarusian (be + be-tarask), although now I have what I had before the git pull.

Yes, you need to run maintenance/updateCollation.php and then purge all category pages.

And if you run into any weird display bugs (like letters sorting under headings containing weird symbols), check out https://bugzilla.wikimedia.org/show_bug.cgi?id=43740 .

-- Matma Rex

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- З павагай, Павел Селіцкас/Pavel Selitskas Wizardist @ Wikimedia projects

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Bartosz Dziewoński

3:24 p.m.

On Thu, 28 Feb 2013 00:33:57 +0100, Paul Selitskas p.selitskas@gmail.com wrote:

...

I had to add 'be-tarask' to $tailoringFirstLetters and set $wgCategoryCollation explicitly to make this thing work.

Yes, it's not enabled by default. That should probably wait until the support is more battle-tested :)

...

Can character mapping be also implemented here? For example, in Belarusian letter «Ґ» should be in the same section as «Г», and «Ў» in the same section as «У». It's not an urgent request, just my curiosity.

I created a testwiki in Belarussian with uca-be collation to test this: http://users.v-lo.krakow.pl/~matmarex/testwiki-be/index.php?title=%D0%9A%D0%...

It seems like Ґ and Г behave correctly. I don't know why Ў and У are separate; probably most languages they're used in consider them entirely separate letters. This is certainly doable, though; we simply need to make Ў not create a heading in the same way we made ё create one; it should start sorting under У then. I didn't realize this kind of behavior is possible :)

(If they are sorted / separated differently on your install, you probably need to run the maintenance/languages/generateCollationData.php script - see https://bugzilla.wikimedia.org/show_bug.cgi?id=43740 .)

-- Matma Rex

Paul Selitskas

3:28 p.m.

New subject: Proper category collation support finally implemented!

The result in the link you provided is ideal. Г and Ґ are in one bucket, while У and Ў are separated. That's what we need.

Great job done!

On Thu, Feb 28, 2013 at 6:24 PM, Bartosz Dziewoński matma.rex@gmail.com wrote:

...

On Thu, 28 Feb 2013 00:33:57 +0100, Paul Selitskas p.selitskas@gmail.com wrote:

...
I had to add 'be-tarask' to $tailoringFirstLetters and set $wgCategoryCollation explicitly to make this thing work.

Yes, it's not enabled by default. That should probably wait until the support is more battle-tested :)

...
Can character mapping be also implemented here? For example, in Belarusian letter «Ґ» should be in the same section as «Г», and «Ў» in the same section as «У». It's not an urgent request, just my curiosity.

I created a testwiki in Belarussian with uca-be collation to test this: http://users.v-lo.krakow.pl/~matmarex/testwiki-be/index.php?title=%D0%9A%D0%...

It seems like Ґ and Г behave correctly. I don't know why Ў and У are separate; probably most languages they're used in consider them entirely separate letters. This is certainly doable, though; we simply need to make Ў not create a heading in the same way we made ё create one; it should start sorting under У then. I didn't realize this kind of behavior is possible :)

(If they are sorted / separated differently on your install, you probably need to run the maintenance/languages/generateCollationData.php script - see https://bugzilla.wikimedia.org/show_bug.cgi?id=43740 .)

-- Matma Rex

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- З павагай, Павел Селіцкас/Pavel Selitskas Wizardist @ Wikimedia projects

Bartosz Dziewoński

3:30 p.m.

On Thu, 28 Feb 2013 16:28:43 +0100, Paul Selitskas p.selitskas@gmail.com wrote:

...

The result in the link you provided is ideal. Г and Ґ are in one bucket, while У and Ў are separated. That's what we need.

Ah, that's great, I misunderstood. :)

-- Matma Rex

Gerard Meijssen

2 Mar 2 Mar

12:30 p.m.

New subject: Proper category collation support finally implemented!

Hoi, I googled for IcuCollation and found this on their website ... Starting in release 1.8, the ICU Collation Service is updated to be fully compliant to the Unicode Collation Algorithm (UCA) ( http://www.unicode.org/unicode/reports/tr10/ ) and conforms to ISO 14651.

My question, will we use a version of IcuCollation that is later than 1.8. Asking if IcuCollation supports the latest version of the UCA is probably too much to ask for... It would give us alphabetic characters after the characters of a default script.

Thanks, GerardM

On 27 February 2013 23:50, Bartosz Dziewoński matma.rex@gmail.com wrote:

...

Just yesterday I managed to get https://gerrit.wikimedia.org/** r/#/c/49776/ https://gerrit.wikimedia.org/r/#/c/49776/ merged. Based heavily on Tim's work on the IcuCollation, it allows one to *finally* get articles to be correctly sorted on category pages for 67 languages based in latin, greek and cyrillic alphabets.

I also created https://bugzilla.wikimedia.**org/show_bug.cgi?id=45443 https://bugzilla.wikimedia.org/show_bug.cgi?id=45443to track the process of getting this deployed to Wikimedia wikis. The process is already underway for uk.wiki and pl.wiki; if anybody technical wishes to get it on their wiki first, please create a sub-bug and start a community discussion/vote - I can provide a testwiki in your language :)

Eventually, I'd like this to be deployed on all wikis in those 67 languages. I'll start poking people about this (and will drop a mail to -ambassadors) once wmf11 is deployed and the change goes live on a few wikis.

-- Matma Rex

______________________________**_________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikitech-l https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Brian Wolff

1:32 p.m.

New subject: Proper category collation support finally implemented!

IcuCollation is the name of the mediawiki class. The actual underlying code is from a software project called Icu (or more specificly icu4c). Which version used depends on which version mediawiki is compiled against. Version 1.8 is really really old (which makes me think you got the wrong software project maybe?). The latest stable release is 50, but 51 is going to be released soon. We use version 4.2 of icu library which implements CLDR 1.7 and Unicode 5.1, which is a tad older but not horribly. I believe people want to update icu to a newer version for better chinese collation. I imagine eventually things would be updated to a version that has the script reordering. Otoh icu library updates have a high cost, so maybe not unless there is a better benefit than reordering the script. (note script reordering is already an option. The only difference in newest uca is that reordering the script is the default instead of an option).

Tl; dr: no. Its compatible with a version of uca. But uca is not a fixed standard and changes.

-bawolff On 2013-03-02 8:31 AM, "Gerard Meijssen" gerard.meijssen@gmail.com wrote:

...

Hoi, I googled for IcuCollation and found this on their website ... Starting in release 1.8, the ICU Collation Service is updated to be fully compliant to the Unicode Collation Algorithm (UCA) ( http://www.unicode.org/unicode/reports/tr10/ ) and conforms to ISO 14651.

My question, will we use a version of IcuCollation that is later than 1.8. Asking if IcuCollation supports the latest version of the UCA is probably too much to ask for... It would give us alphabetic characters after the characters of a default script.

Thanks, GerardM

On 27 February 2013 23:50, Bartosz Dziewoński matma.rex@gmail.com wrote:

...
Just yesterday I managed to get https://gerrit.wikimedia.org/** r/#/c/49776/ https://gerrit.wikimedia.org/r/#/c/49776/ merged. Based heavily on Tim's work on the IcuCollation, it allows one to *finally* get articles to be correctly sorted on category pages for 67 languages based

in

...
latin, greek and cyrillic alphabets.

I also created https://bugzilla.wikimedia.**org/show_bug.cgi?id=45443<

https://bugzilla.wikimedia.org/show_bug.cgi?id=45443%3Eto track the process of getting this deployed to Wikimedia wikis. The

...
process is already underway for uk.wiki and pl.wiki; if anybody technical wishes to get it on their wiki first, please create a sub-bug and start a community discussion/vote - I can provide a testwiki in your language :)

Eventually, I'd like this to be deployed on all wikis in those 67 languages. I'll start poking people about this (and will drop a mail to -ambassadors) once wmf11 is deployed and the change goes live on a few wikis.

-- Matma Rex

______________________________**_________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikitech-l<

https://lists.wikimedia.org/mailman/listinfo/wikitech-l%3E _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

4308

Age (days ago)

4311

Last active (days ago)

wikitech-l@lists.wikimedia.org

12 comments

6 participants

tags (0)

participants (6)

Bartosz Dziewoński
Bináris
Brian Wolff
Gerard Meijssen
Paul Selitskas
Siebrand Mazeland (WMF)