Re: [Wikitech-l] More aggressive DEFAULTSORT

14 May 2009


      On Thu, May 14, 2009 at 7:38 AM, Domas Mituzas midom.lists@gmail.com wrote:
...
[13:36:06]      GerardM-        so how do we currently deal with the languages
from India where the order of Unicode is almost certainly to be wrong
[13:36:17]      domas   well, currently we're using byte order
[13:36:24]      domas   it is not any kind of unicode order
[13:36:35]      GerardM-        so there is no proper sorting
[13:36:36]      domas   as utf8 is variable length, offsets of character
starts are different
Well, a binary sort of UTF-8 is code point-order.  One-byte characters
start with 0, two-byte characters start with 110, three-byte
characters start with 1110, four-byte characters start with 11110, so
they'll always sort as 1-byte < 2-byte < 3-byte < 4-byte, and the
variable length makes no difference.  But code point order isn't very
good: even in English, z < A, let alone languages with diacritics or
whatnot.
An interesting discussion, anyway.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] More aggressive DEFAULTSORT