At some point in the near future I'll be adding in a per-language sort order adjustment, so that various sorted lists should turn out in more or less correct order for a change. :)
I'd appreciate pointers to descriptions of various languages' sorting requirements so I can try to get them right.
I don't know if we can handle Japanese and Chinese sensibly, but alphabetic languages should generally work fairly well by making a munged copy of the string such that, eg, if "ó" sorts as the same as "o" we just change it to "o"; if "ó" sorts after "o" (as in Polish IIRC), it becomes "o~", which should always sort after any "o" and before any "p" in a binary ASCII-order string sort.
Simple replacements should generally work, though we can also do more complicated replacements of certain sequences of characters.
-- brion vibber (brion @ pobox.com)
On Tue, May 20, 2003 at 11:29:31AM -0700, Brion Vibber wrote:
I don't know if we can handle Japanese and Chinese sensibly, but
FYI, Japanese is complicated to sort.
I believe this is the order:
Kana: a i u e o ka/ga ki/gi (kya/gya kyu/gyu kyo/gyo) ku/gu ke/ge ko/go sa/za shi/ji (sha/ja shu/ju sho/jo) su/zu se/ze so/zo ta/da chi/X (cha chu cho) tsu/dzu te/de to/do na ni (nya nyu nyo) ne ne no ha/ba/pa hi/bi/pi (hya/bya/pya hyu/byu/pyu hyo/byo/pyo) fu/bu/pu \ he/be/pe ho/bo/po ma mi (mya myu myo) mu me mo ya yu yo ra ri (rya ryu ryo) ru re ro wa (wo [particle, tho]) n
As far as Kanji goes, I believe it is sorted (any of these is OK): * First by a characters primary radical, then by remaining strokes * First by total strokes, then by primary radical * Sound
Good luck! ;)
On Tue, May 20, 2003 at 11:29:31AM -0700, Brion Vibber wrote:
At some point in the near future I'll be adding in a per-language sort order adjustment, so that various sorted lists should turn out in more or less correct order for a change. :)
I'd appreciate pointers to descriptions of various languages' sorting requirements so I can try to get them right.
I don't know if we can handle Japanese and Chinese sensibly, but alphabetic languages should generally work fairly well by making a munged copy of the string such that, eg, if "ó" sorts as the same as "o" we just change it to "o"; if "ó" sorts after "o" (as in Polish IIRC), it becomes "o~", which should always sort after any "o" and before any "p" in a binary ASCII-order string sort.
Simple replacements should generally work, though we can also do more complicated replacements of certain sequences of characters.
1. In some languages certain letter pairs are treated as single letter, for example in Czech, "ch" is a letter, so "ca", "cz", "ch", "da" would be the correct sort order ;) Polish is 100% sane about that, maybe with exception of having two diactrics based on z (order: y z z' z.).
2. Some languages sort first by primary then by secondary characteristics, so it's *not lexicographical order* For exampre to sort Japanese kana you have to: if (strip_"_ond_o(x) != strip_"_ond_o(y)) return strip_"_ond_o(x)-strip_"_ond_o(y); else return x-y;
So order is like: kou gou kouin. Then, sorting kanji is even worse.
(Brion Vibber brion@pobox.com): At some point in the near future I'll be adding in a per-language sort order adjustment, so that various sorted lists should turn out in more or less correct order for a change. :)
I'd appreciate pointers to descriptions of various languages' sorting requirements so I can try to get them right.
Collation rules for all languages are defined in the Unicode spec; I believe MySQL contains many of them, but I'm not sure how to tell it how to use them. It's often a lot more complex than doing a few character substitutions, even for some fairly common languages (for example, Spanish requires some 2-to-1 subs, German a 1-to-2, and French uses accents only when necessary).
On Tue, 20 May 2003, Lee Daniel Crocker wrote:
Collation rules for all languages are defined in the Unicode spec;
Well, that could be handy. :) I'll see if I can dig them up.
hmm... This looks like a place to start: http://www.unicode.org/unicode/reports/tr10/
I believe MySQL contains many of them, but I'm not sure how to tell it how to use them.
MySQL's really ugly in this regard. First, no UTF-8 support at all.* The collation order modules that it does have (for some 8-bit charsets and some multibyte) can only be enabled on a server-wide basis, so we can't say "this database sorts as english, this one sorts as german, this one sorts as polish" unless we run separate instances of MySQL.
* Allegedly 4.1 has/will have some unicode support. It's not stable though.
** Yes, I know PostgresQL has Unicode support. :) I don't know if it supports per-table or per-column selection of collation order, and there would be much other work to get Wikipedia running on it.
-- brion vibber (brion @ pobox.com)
On Tue, May 20, 2003 at 01:46:48PM -0700, Brion Vibber wrote:
On Tue, 20 May 2003, Lee Daniel Crocker wrote:
Collation rules for all languages are defined in the Unicode spec;
Well, that could be handy. :) I'll see if I can dig them up.
hmm... This looks like a place to start: http://www.unicode.org/unicode/reports/tr10/
I believe MySQL contains many of them, but I'm not sure how to tell it how to use them.
MySQL's really ugly in this regard. First, no UTF-8 support at all.* The collation order modules that it does have (for some 8-bit charsets and some multibyte) can only be enabled on a server-wide basis, so we can't say "this database sorts as english, this one sorts as german, this one sorts as polish" unless we run separate instances of MySQL.
- Allegedly 4.1 has/will have some unicode support. It's not stable
though.
** Yes, I know PostgresQL has Unicode support. :) I don't know if it supports per-table or per-column selection of collation order, and there would be much other work to get Wikipedia running on it.
Well, PostgreSQL allows you to set the encoding on a per database basis. So, you can have some databases with UTF-8, some with EUC_JP, etc. I don't think you can have some ASCII rows and some unicode rows, although I could certainly be wrong. Its collation rules are based on whatever character set the database is.
Nick Reinking wrote:
Well, PostgreSQL allows you to set the encoding on a per database basis. So, you can have some databases with UTF-8, some with EUC_JP, etc. I don't think you can have some ASCII rows and some unicode rows, although I could certainly be wrong. Its collation rules are based on whatever character set the database is.
Database is correct. See: http://www.postgresql.org/docs/view.php?version=7.3&idoc=0&file=mult...
Brion Vibber wrote:
At some point in the near future I'll be adding in a per-language sort order adjustment, so that various sorted lists should turn out in more or less correct order for a change. :)
I'd appreciate pointers to descriptions of various languages' sorting requirements so I can try to get them right.
If you set the environment variable LC_COLLATE to sv_SE.ISO8859-1 or sv_SE.UTF-8, Linux sort(1), strcmp(3), qsort(3) and MySQL will do the right thing for Swedish. I think this true for PHP as well.
At some point in the near future I'll be adding in a per-language sort order adjustment, so that various sorted lists should turn out in more or less correct order for a change. :)
I'd appreciate pointers to descriptions of various languages' sorting requirements so I can try to get them right.
I have recently made a little PHP script that sort correctly in danish. You can find the result here: http://www.wikipedia.dk/wiki/sortering.php The code snippet below does the actual sorting, maybe it can give you some inspiration on how to do it. The key functions here is strtr() which replaces all the weird characters with the correct characters for sorting in danish, and usort() which does the actual sorting.
<? function cmp ($a, $b) { $compa = strtr($a,
"SOZsozY¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿABCDEF GHIJKLMNOPQRSTUVWXYZ",
"sozsozyyuaaaaaøåceeeeiiiidnoooooæuuuuysaaaaaøåceeeeiiiionoooooæuuuuyyabcdef ghijklmnopqrstuvwxyz"); $compb = strtr($b,
"SOZsozY¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿABCDEF GHIJKLMNOPQRSTUVWXYZ",
"sozsozyyuaaaaaøåceeeeiiiidnoooooæuuuuysaaaaaøåceeeeiiiionoooooæuuuuyyabcdef ghijklmnopqrstuvwxyz"); for ($i=0 ; $i<strlen($compa) ; $i++) { if (strlen($compb)==$i) { if ($_POST["Orden"]=="Stigende") { return -1; } else { return 1; } }
if ($compa{$i} > $compb{$i}) { if ($_POST["Orden"]=="Stigende") { return 1; } else { return -1; } } else if ($compa{$i} < $compb{$i}) { if ($_POST["Orden"]=="Stigende") { return -1; } else { return 1; } } } return 0; }
$tekst = $_POST["foo"]; $myarray = explode("\n",$tekst ); usort($myarray, "cmp"); $tekst = implode("\n",$myarray); print $tekst;
?>
Regards Christian
BTW: I have bought wikipedia.dk and made it redirect to da.wikipedia.org, and it will stay that way untill the folks at da.wikipedia.org might decide otherwise. I will also place little scripts like this danish sorting script on http://www.wikipedia.dk/wiki/sortering.php
wikitech-l@lists.wikimedia.org