Re: [Wikitech-l] Binary vs. non-binary strings (was: What is required to "fix search"?)

15 Apr 2006


      On 4/14/06, Brion Vibber brion@pobox.com wrote:
[snip]
...
It could break string matching, but would definitely break sorting. (Sorting by
codepoint may suck, but at least it's predictable.)
More generally, deliberately choosing a non-binary collation which applies to a
*different character set* from the one really you're using seems pretty silly.
You get unpredictable, incorrect sorting and potentially have strings rejected
as invalid.
The collation problem is a hard problem in general, as I understand
it, as there are some cases where the collation of some unicode
characters changes depending on the language.. For example, the
position of ø in danish vs most other languages.  ... although doing
it wrong but mostly right isn't too hard.
Thus supporting multiple languages correctly in a single database
becomes a little difficult. I don't think it's reasonable to expect
the database to allow you to magically specify a new collation on the
fly for each query, since index order depends on collation.
Instead, given sufficient support in the database, you could create a
function  enumerate_collation(language,string) which returns an
integer array (or a mangled string), with one value for the absolute
collation position of each character in the string.  You could then
define index on that function applied to the title column for each of
the collations you will be using, and ORDER BY
enumerate_collation('en',title) in your queries.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Binary vs. non-binary strings (was: What is required to "fix search"?)