Re: [Wikitech-l] On proper sorting using CLDR (was: varchar(255) binary in tables.sql)

10 Jun 2010

Hoi,
I am really happy with your extensive description of why it is such a pain
in the arse. The situation is even worse, there are more wikipedia languages
then there are languages with a proper CLDR description. It would be a dear
thing when we could strongly urge our language communities to verify, append
and amend the CLDR. It would make a *practical* difference in
translatewiki.net.

You are right that it is not an absolute road block for other languages to
have their Wikipedia. It is not. It is however amazing that we have a
Wikipedia in languages like Hindi and Malayalam. The problem for those
languages is even more basic. They have problems with Unicode itself.

To appreciate this compare the Indonesian Wikipedia with all the Wikipedias
of the Indian subcontinent. As Bahasa Indonesia is written in the Latin
script, it is that much easier to write articles for that language. As a
result you will find that the Indonesian Wikipedia is bigger in traffic then
all the Indian Wikipedias combined.

In conclusion, we need to spend genuine effort in supporting other scripts.
I appreciate that you are not volunteering. It would however be a project
that would make a big difference to many of our projects.
Thanks,
       GerardM

On 10 June 2010 14:40, Domas Mituzas &lt;midom.lists(a)gmail.com&gt; wrote:

...
  Hi!

  Yes it is a technical pain in the arse.The
question is one of primacy. Is  it
  more important to provide service or are
technical considerations of the
 most importance. Yes, we discussed this in the past and we did not agree
 then and we do not agree now. 
 Well, I agree that it might be good idea to have language-specific
 ordering, just costs are quite high and there're not too many people eager
 to do engineering part of such project.
 CLDR isn't panacea, it is constantly evolving project, with inaccurate
 stable versions (even for well established languages like mine, heheh), and
 various proposed/testing versions.

 So, to pick CLDR based flow, and do it properly, it would consist of
 infinite loop of:

 1. Understanding which languages need a separate collation
 2. Evaluating all available collations for a language, attracting input
 from local communities and standardization bodies
 3. Evaluating the algorithmic implications of chosen collation - then
 either approaching standards bodies to change it, or simplifying it
 internally (and forking), or implementing algorithms in software (though
 that sometimes is impossible to do in efficient way)
 4. Porting (3) into a backend of choice
 5. Provide upgrade path and conflict resolution method for existing content
 6. Provide framework to do full index rebuilds and switchover between
 different collations (ok, this probably is one-time engineering project,
 albeit quite complex, as it has to have (4) and (5) in mind)
 7. Monitor for new versions of collations :)

 Multiply all that by number of languages we have, and do note that there're
 multiple sorting variants per language too (e.g. dictionary vs phonebook
 ordering in Germany).
 So yes, it would be fantastic to have that kind of functionality, but you'd
 need quite some engineering capacity to pull it off.

 And if we get to implementation specifics - ordering rules are same as
 equality rules, causing quite some confusion in some cases (and some people
 will definitely want to have same sorted but not equal terms.. :)

 Of course, we can use community driven sortkey hacks for some features ;-)

  I wonder how our English language readers would
react when the sort order
 for their lists would be wrong. 
 I guess it isn't absolutely tragic for others, as otherwise we wouldn't see
 projects in other languages at all. Now thats a benchmark! ;-)

 Domas
 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] On proper sorting using CLDR (was: varchar(255) binary in tables.sql)