Hoi,
Collation based on Unicode ... what do you mean by that ? Do you mean the
order of the characters in the UTF-8 or do you mean the Unicode CLDR order.
The last is the only sensible approach. The idea of the DEFAULTSORT is imho
awful; you want people to invest heavily on something that could work
properly when we decided to use the appropriate standards.
The one argument why I think this DEFAULTSORT is that it is too much of a
burden for the "other" languages and projects. Most projects still need to
concentrate on basic content and this is just another "must have"
distraction. Again, it makes sense to implement the appropriate standards in
stead of applying an ugly hack.
Thanks,
GerardM
2009/5/12 Aryeh Gregor
<Simetrical+wikilist@gmail.com<Simetrical%2Bwikilist@gmail.com
> On Mon, May 11, 2009 at 3:29 PM, Lars Aronsson <lars(a)aronsson.se> wrote:
> > There is a way to avoid all such problems, namely by a more
> > aggressive use of DEFAULTSORT that removes from sorting all upper
> > case letters (except the initial one), all whitespace and all
> > commas. It would mean almost every article needs a DEFAULTSORT.
> > In the examples above:
>
> > {{DEFAULTSORT:Walesjimmy}}
> > {{DEFAULTSORT:Europeancourtofauditors}}
> > {{DEFAULTSORT:Europeanunionmission}}
> > {{DEFAULTSORT:Europeanquarterofbrussels}}
> > {{DEFAULTSORT:Moonillusion}}
> This would be a good thing to do in the
software. We could implement
> the framework reasonably easily, if anyone cares to, and then let each
> language do its thing. A basic English implementation like this would
> be easy enough.
> Of course, any change to the sortkey beyond
the first will require
> that all existing sort keys be changed by a batch job -- otherwise
> sorting will be a mess. Every change to the sortkey algorithm would
> either require that all pages be reparsed (very expensive), or that a
> special conversion script be defined to account for that exact change.
> Unless it's minor enough that the inconsistency is acceptable, I
> guess.
> On Tue, May 12, 2009 at 7:18 AM, Petr
Kadlec <petr.kadlec(a)gmail.com
> wrote:
> > Well, not really. Bug 164 would be fixed almost completely for
> > Czech-language wikis by using database features designed for exactly
> > this problem. [1] But, I guess you know the situation.
> > ...
> > [1]
http://dev.mysql.com/doc/refman/4.1/en/charset-collation-effect.html
> Note the version. Wikimedia uses MySQL
4.0, which doesn't contain any
> charsets or collations other than binary. If we used a higher
> version, utf8 might be an option: that would use a Unicode collation,
> I guess, which should at least be okay for most languages, if not
> perfect. (But MySQL's utf8 has other downsides, like being
> variable-width and not supporting Unicode outside the BMP.)
> > If Swedish sorting rules are simple
enough that removing all
> > whitespace and punctuation and converting to lower case would solve
> > most of the problems, I would say that such feature would not be too
> > difficult to implement right into MediaWiki (into LanguageSv.php),
> > writing those DEFAULTSORT codes explicitly into every article would be
> > nonsense, IMHO. (So, go ahead with it, I won’t stop you or anything,
> > I’m just trying to say that this is not really a solution for Czech
> > language.)
> There's no reason this couldn't be
implemented for Czech as well in
> the software, in principle. Ideally we'd use something based on
> Unicode collation as a baseline, with optional customizations per
> language:
>
http://unicode.org/reports/tr10/
>
_______________________________________________
> Wikitech-l mailing list
> Wikitech-l(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/wikitech-l