On 4/14/06, Brion Vibber brion@pobox.com wrote: [snip]
It could break string matching, but would definitely break sorting. (Sorting by codepoint may suck, but at least it's predictable.)
More generally, deliberately choosing a non-binary collation which applies to a *different character set* from the one really you're using seems pretty silly. You get unpredictable, incorrect sorting and potentially have strings rejected as invalid.
The collation problem is a hard problem in general, as I understand it, as there are some cases where the collation of some unicode characters changes depending on the language.. For example, the position of ΓΈ in danish vs most other languages. ... although doing it wrong but mostly right isn't too hard.
Thus supporting multiple languages correctly in a single database becomes a little difficult. I don't think it's reasonable to expect the database to allow you to magically specify a new collation on the fly for each query, since index order depends on collation.
Instead, given sufficient support in the database, you could create a function enumerate_collation(language,string) which returns an integer array (or a mangled string), with one value for the absolute collation position of each character in the string. You could then define index on that function applied to the title column for each of the collations you will be using, and ORDER BY enumerate_collation('en',title) in your queries.