Hello,
On Jun 8, 2010, at 11:22 PM, Gerard Meijssen wrote:
The difference is that is actually does sort according
to the CLDR.. It
would be really nice if we did that.
It does not, it sorts according to the partial UCA implementation.
We have discussed CLDR in the past - it is a huge collection of distinct collations, and
even though it is possible to use LDMLs from CLDR project, it is PITA, due to both partial
UCA support and continuous effort to rebuild indexing, resolve conflicts, and hit all
sorts of obscure "linguists are not computer scientists" problems :)
On Jun 8, 2010, at 5:28 PM, Paul Houle wrote:
As a person who has labored mightily to make sense
of dbpedia, I
think that one reason why varbinary is preferable to varchar in many
applications in wikimedia is that varchar() string comparisons are case
insensitive and varbinary comparisons are case sensitive.
varchar with case insensitive collations is case insensitive, varchar with binary/case
sensitive collations is case sensitive.
varbinary() otoh is varchar with 'binary' character set (if you define default
server charset to be binary, as we do on our 5.x boxes, all varchar creation will be
varbinary).
There are 10,000 or so articles in the english
wikipedia that have
titles that vary only by case. Load those into a varchar(255) and put a
primary key on them and mysql just won't let you do it.
Depends on a collation, but yes, you are right. There're more concerns there, not just
case sensitivity, though.
Different collations can map different digraphs or different diacritics to different
codepoints, causing quite some confusion.
Like in my language, ą = a, but š > s :)
I looked at a sample of those article and came to the
conclusion
that the semantic relations between them are complicated enough that
they cannot be autosquashed.
Indeed. If you go for CLDR-like national collations, you expose yourself not just to case
sensitivity though, but also to all the different digraph/accented character mappings,
that add even more confusion to your uniqueness constraints.
On Jun 8, 2010, at 3:58 PM, Ryan Chan wrote:
obviously, varchar(255) binary does not support
character outside of BMP.
It does, if you do very very horrible hack of using latin1 character set (but I'd
always say that is bad idea and binary charset aka varbinary should be used instead).
Domas