On Thu, May 14, 2009 at 7:38 AM, Domas Mituzas midom.lists@gmail.com wrote:
[13:36:06] GerardM- so how do we currently deal with the languages from India where the order of Unicode is almost certainly to be wrong [13:36:17] domas well, currently we're using byte order [13:36:24] domas it is not any kind of unicode order [13:36:35] GerardM- so there is no proper sorting [13:36:36] domas as utf8 is variable length, offsets of character starts are different
Well, a binary sort of UTF-8 is code point-order. One-byte characters start with 0, two-byte characters start with 110, three-byte characters start with 1110, four-byte characters start with 11110, so they'll always sort as 1-byte < 2-byte < 3-byte < 4-byte, and the variable length makes no difference. But code point order isn't very good: even in English, z < A, let alone languages with diacritics or whatnot.
An interesting discussion, anyway.