Hi!
me and Gerard had some chat about this stuff today, I'll post
transcript if anyone is interested:
[11:47:03] GerardM- Domas, could you explain to me why it would make
a full time job to implement proper collations ? Would it not make
sense when it is part of the standard MySQL ?
[11:47:37] domas well, a) someone would have to actually fix CLDR stuff
[11:47:56] GerardM- what is wrong with it ?
[11:47:56] domas see, if proposed/draft is better than 'stable', etc
[11:47:59] domas it has errors
[11:48:10] domas I looked at random collation, and could see errors
[11:48:48] domas example:
http://unicode.org/cldr/data/common/collation/lt.xml?rev=1.26&content-t…
[11:49:36] GerardM- what is wrong with it ? and also is there a
better approach to this ?
[11:49:52] GerardM- it seems to me that having a standard for this is
the way to go
[11:50:03] domas well, status quo is people maintaining this separately
[11:50:07] domas we at mysql maintain it separately
[11:50:10] domas oracle maintain it separately
[11:50:10] domas etc
[11:50:14] domas CLDR is not a standard yet
[11:50:25] domas it is a good source of information, yes
[11:50:37] GerardM- you mean it is not THE standard that everyone
adheres to
[11:50:44] domas it is not standard per se
[11:50:48] domas CLDR is a project, a collection
[11:50:59] domas a repository
[11:51:16] GerardM- it is treated with the same care as other
standards .. eg ISO-639
[11:51:29] domas well, probably the care is similar, yes, just the
amount of data there is huge
[11:51:38] domas and it is a bit sloppy
[11:51:45] domas I'm sure ISO639 has way less data
[11:51:46] domas :)
[11:51:48] domas anyway
[11:51:52] domas b)
[11:51:56] domas actual merging of this
[11:52:14] domas this includes validating if the collation is
_indexable_
[11:52:32] domas see, this defines comparison rules
[11:52:50] domas to simplify
[11:52:59] domas this is possible: a>b, b>c, c>a
[11:53:09] GerardM- given that we are talking about UTF-8 collation,
there should be no problem ordering them
[11:53:19] domas no, there is always problem
[11:53:30] domas say, in some collation 'll' is same as 'l'
[11:53:35] domas (thats real fact ;-)
[11:53:55] GerardM- in proper Dutch collation ij is one character
[11:53:58] domas right
[11:54:05] domas how to index that?
[11:54:48] domas I've opened quite a few bugs on issues like that in
past few years :)
[11:54:52] domas in mysql
[11:54:56] GerardM- the UTF-8 solution would be that there is a
seperate character for the ij
[11:55:15] domas yes, but if people enter 'ij' it has to be treated
in same way as that special character
[11:55:20] GerardM- there are many such issues
[11:55:22] GerardM- yes
[11:55:24] domas and how does prefixindex work for 'i%' ?
[11:55:30] domas should it include ij's or not?
[11:55:32] domas :)
[11:55:38] domas (it won't)
[11:55:41] domas oh well.
[11:55:44] domas or it may
[11:55:46] domas but it will break indexing
[11:56:00] domas c)
[11:56:04] GerardM- ok so the question is how can we have MySQL and
the CLDR people work together
[11:56:25] GerardM- because it is in everyone's interest to get to a
standard solution
[11:57:09] domas well, mysql supports LDMLs or whichever was the format
[11:57:14] domas here we come to c)
[11:57:18] domas and c) is
[11:57:23] domas deployment
[11:57:47] domas there're two major problems here
[11:58:19] domas during conversion to different charset one would
have diverging datasets in different orders
[11:58:45] domas so, replication can cause data drift
[11:58:53] domas another is uniqueness
[11:59:00] domas e.g. the change I showed you in lt.xml
[11:59:09] domas it is changing unique constraints
[11:59:21] domas 'i' and 'y' were separate in 'standard', and are
together in 'draft'
[12:00:06] domas so, there would have to be a way to resolve all the
collisions in page titles
[12:00:15] domas which is no longer a technology problem, as
community has to be involved
[12:00:50] GerardM- sorting and constraints are imho separate things
[12:00:57] domas you think so :))
[12:02:16] GerardM- when i and y are equal, you show them next to
each other
[12:02:17] GerardM- when the constraint has it that there can only be
one i or y it is a separate rule that can be implemented in a
different way
[12:02:24] domas well
[12:02:31] domas you cannot have different constraints for ordering/
uniqueness
[12:02:33] GerardM- I am sure that it can be implemented in a stored
procedure
[12:02:36] domas as in, different rulesets
[12:03:51] domas tertiary weights would help here, probably, but on
the other hand
[12:03:56] domas 'i' and 'y' _are_ same :)
[12:04:09] domas or, well, who is authority on uniqueness then
[12:04:13] domas if we have one on sorting
[12:04:36] GerardM- the issue you raised was how to deal with
existing data where i and y both exist
[12:05:06] domas well, or 'i' and 'I' as this would lead to case
insensitive unique constraints
[12:05:31] GerardM- so allowing for existing garbage means you have
to be able to sort. uniqueness rules would then exist for new content
[12:05:49] domas you can't
[12:05:56] domas uniqueness rules are enforced by indexing
[12:06:11] GerardM- that is one way of doing it
[12:06:25] GerardM- technically the best
[12:06:35] domas I understand the other way
[12:06:39] domas and I don't want to hear about it ;-)
[12:06:44] GerardM- <grin>
[12:07:04] domas it would probably mean converting all the indexing
into non-unique
[12:07:12] domas not having any varchar PKs
[12:07:31] domas building another layer for multiple-weight comparisons
[12:07:51] domas etc
[12:07:57] domas complexity is insane
[12:08:01] domas we don't do insanely complex things :))
[12:08:09] GerardM- So the question is, how do we get to an open
standard for this ?
[12:08:23] GerardM- and how do we get some proper sorting for
MediaWiki ?
[12:08:24] domas well, let's say CLDR _is_ an open standard
[12:08:38] domas even then, there's huge work to migrate all that
data to MW's data store
[12:08:40] domas and maintain it there
[12:08:48] domas as by a), b), c)
[12:08:50] domas let me add d)
[12:09:01] GerardM- CLDR data IS already implemented in some of the
extensions !!
[12:09:03] domas amount of DBA work, with staging
[12:09:12] domas CLDR data is easy to implement
[12:09:20] domas some of CLDR data is implemented as mysql
collations, or will be
[12:09:27] domas maybe all mysql collations will move to CLDR data in
future
[12:09:46] domas bugs like that get opened:
http://bugs.mysql.com/bug.php?id=37898
[12:10:27] GerardM- That is cool ..
[12:10:47] GerardM- so people can ask MySQL to implement particular
collations ?
[12:10:48] domas and... collation changes are _incompatible_changes_
on datasets
[12:10:54] domas of course
[12:11:04] GerardM- do they get implemented ?
[12:11:08] domas takes a while
[12:11:14] domas it goes to next alpha/development release
[12:11:28] GerardM- right but that is normal in standards as well
[12:11:50] GerardM- so how long would it take .. given a positive
approach
[12:12:55] domas make mediawiki support CLDR - easy
[12:12:59] domas as in, the stack
[12:13:12] domas make the stack support CLDR as a process suitable
for Wikimedia - extreme pain
[12:14:25] GerardM- on the phone
[12:14:28] domas okie
[12:14:35] domas (btw, can I publish this transcript on mailing
list? :)
[12:14:45] domas it is quite informative
[12:15:14] GerardM- <grin> we are not done talking and I was to ask
you if I could blog about this :)
[12:15:34] domas I'm just asking in advance!
[12:15:43] domas btw, as a fun note
[12:15:59] domas CLDR does not resolve problems, where dictionaries
are needed for sorting
[12:16:07] domas my favorite example, provided by River: Welsh.
[12:16:13] domas \o/
[12:17:10] domas gonna go grab some coffee too
[12:20:29] domas hold on, switching back from GPRS to DSL
[12:20:39] * Disconnected ().
[12:20:50] * Disconnected ().
[12:21:00] * Disconnected ().
[12:21:46] * domas_ sets mode -e domas_
[12:21:46] * services. sets mode +e domas
[12:21:49] domas back!
[12:45:19] GerardM- Hoi
[12:45:25] GerardM- back from the phone
[12:46:25] GerardM- given that there are over 7000 languages and that
there are over 25000 linguistic entities (including dialects and
orthographies) it is certainly a "growth area
[12:48:28] GerardM- What I am looking for is that we at least support
the WMF languages properly and that is already enough of a challenge
[13:04:59] domas hehe
[13:05:02] domas as I've said
[13:05:03] domas for now
[13:05:06] domas it is huge huge effort
[13:05:35] GerardM- we can do one language at a time <grin> and start
with English </grin>
[13:07:40] GerardM- What you do not say is that there is an
alternative to this ... so it is a job that needs doing
[13:08:11] GerardM- and imho making sure that it fits in with what is
the standard makes it a best effort
[13:08:32] domas english is easiest
[13:08:35] domas kind of
[13:08:44] domas because we sort english properly even now %)
[13:09:18] domas and to have this going forward, we need multiple
people to work on this project
[13:09:34] domas DBA, engine development, content conflict testing,
collation evaluation, etc
[13:09:44] domas + more hardware for regular staging :)
[13:09:59] domas + rethinking of case sensitivity issues
[13:10:15] GerardM- if we would do it TOGETHER with the Unicode
people, we could get funding for it
[13:10:37] domas I personally am too lazy to work on this project :))
[13:11:38] GerardM- there are other things where you can make a
difference as well
[13:12:19] GerardM- I learned from Siebrand at Berlin about something
that would make our servers more efficient and thereby allow for
growth for a couple of months ...
[13:12:28] GerardM- had to do with loading messages
[13:12:42] domas thats for appservers
[13:12:43] domas :)
[13:12:46] domas thats my plan!
[13:12:49] domas my project!
[13:13:12] GerardM- That is certainly important
[13:13:32] GerardM- and <grin> there is certain to be much more that
needs doing
[13:13:37] domas yes
[13:13:49] domas all this CLDR stuff is _very_ complex, when it comes
to actual implementation
[13:14:00] domas do note, we're already way ahead of curve of others
when it comes to unicode
[13:14:11] domas when I had to talk about certain charset/collation
issues in a conference
[13:14:19] domas the only public implementation of unicode
normalization in software was mediawiki
[13:14:20] domas :)
[13:14:30] GerardM- <grin> you know my rants about Lingala ? We sure
have a long way to go !! </grin>
[13:22:03] GerardM- Would MySQL be interested in this project ?
[13:22:55] domas file a bug, and see! :)
[13:23:20] GerardM- this would be to implement CLDR collation on the
Wikipedia projects
[13:23:48] GerardM- for them it would be a research project ?
[13:24:00] domas why research project
[13:24:03] domas we know all the problems already :)
[13:24:19] GerardM- the problems of implementing CLDR ?
[13:24:29] domas as I've said, MySQL supports LDML
[13:24:48]
domas
http://dev.mysql.com/doc/refman/5.0/en/adding-collation-unicode-uca.html
[13:25:07] domas as I've said
[13:25:09] domas CLDR is just a repo
[13:25:13] GerardM- skyoing
[13:25:15] domas LDML is spec, standard
[13:30:04] GerardM- reading about LDML to find the relation to the
collation
[13:32:41] domas lol
[13:32:47] domas Each collation must have a unique ID. To add a new
collation, you must choose an ID value that is not currently used. The
value must be in the range from 1 to 254. The collation ID that you
choose will show up in these contexts:
[13:32:48] domas hahahahaha
[13:32:51] domas thats awesome limitation
[13:33:15] domas As of MySQL 6.0.8, the range of IDs from 1024 to
2047 is reserved for user-defined collations.
[13:33:18] domas thats better
[13:33:23] GerardM- yes
[13:33:24] GerardM- indeed
[13:36:06] GerardM- so how do we currently deal with the languages
from India where the order of Unicode is almost certainly to be wrong
[13:36:17] domas well, currently we're using byte order
[13:36:24] domas it is not any kind of unicode order
[13:36:35] GerardM- so there is no proper sorting
[13:36:36] domas as utf8 is variable length, offsets of character
starts are different
[13:43:57] GerardM- fyi Martin Benjamin is working on the Afrigen
project, a project to get CLDR data for African languages
[13:44:20] domas as I've said
[13:44:22] domas using that data statically
[13:44:28] domas on a mediawiki deployment
[13:44:30] domas is very easy
[13:44:34] domas the process is difficult
[13:45:36] GerardM- is having this deployed in different ways per
database instance going to have an impact ?
[13:46:03] domas it isn't problem with an instance
[13:46:06] domas it is problem with data
[13:46:45] GerardM- right and this get you in either conversions or
in inabilties to deal with them
[13:47:13] domas yup
[13:47:18] domas it is both problematic and labor-intensive work
[13:47:32] domas anyway
[13:47:38] domas switching to generic unicode sorting
[13:47:41] domas would be huge step ahead
[13:47:55] domas but we're not ready for that atm, simpy because
until mysql6.0, there's no unicode support outside BMP
[13:48:45] GerardM- so when would mysql6 be an option ?
[13:49:19] domas *shrug*, it is alpha now
[13:49:50] GerardM- <grin> it would be an option when you say.. I
will support it </grin>
[13:50:10] domas yes, kind of