On Thu, May 14, 2009 at 7:38 AM, Domas Mituzas <midom.lists(a)gmail.com> wrote:
[13:36:06] GerardM- so how do we currently
deal with the languages
from India where the order of Unicode is almost certainly to be wrong
[13:36:17] domas well, currently we're using byte order
[13:36:24] domas it is not any kind of unicode order
[13:36:35] GerardM- so there is no proper sorting
[13:36:36] domas as utf8 is variable length, offsets of character
starts are different
Well, a binary sort of UTF-8 is code point-order. One-byte characters
start with 0, two-byte characters start with 110, three-byte
characters start with 1110, four-byte characters start with 11110, so
they'll always sort as 1-byte < 2-byte < 3-byte < 4-byte, and the
variable length makes no difference. But code point order isn't very
good: even in English, z < A, let alone languages with diacritics or
whatnot.
An interesting discussion, anyway.