Hi!
me and Gerard had some chat about this stuff today, I'll post transcript if anyone is interested:
[11:47:03] GerardM- Domas, could you explain to me why it would make a full time job to implement proper collations ? Would it not make sense when it is part of the standard MySQL ? [11:47:37] domas well, a) someone would have to actually fix CLDR stuff [11:47:56] GerardM- what is wrong with it ? [11:47:56] domas see, if proposed/draft is better than 'stable', etc [11:47:59] domas it has errors [11:48:10] domas I looked at random collation, and could see errors [11:48:48] domas example: http://unicode.org/cldr/data/common/collation/lt.xml?rev=1.26&content-ty... [11:49:36] GerardM- what is wrong with it ? and also is there a better approach to this ? [11:49:52] GerardM- it seems to me that having a standard for this is the way to go [11:50:03] domas well, status quo is people maintaining this separately [11:50:07] domas we at mysql maintain it separately [11:50:10] domas oracle maintain it separately [11:50:10] domas etc [11:50:14] domas CLDR is not a standard yet [11:50:25] domas it is a good source of information, yes [11:50:37] GerardM- you mean it is not THE standard that everyone adheres to [11:50:44] domas it is not standard per se [11:50:48] domas CLDR is a project, a collection [11:50:59] domas a repository [11:51:16] GerardM- it is treated with the same care as other standards .. eg ISO-639 [11:51:29] domas well, probably the care is similar, yes, just the amount of data there is huge [11:51:38] domas and it is a bit sloppy [11:51:45] domas I'm sure ISO639 has way less data [11:51:46] domas :) [11:51:48] domas anyway [11:51:52] domas b) [11:51:56] domas actual merging of this [11:52:14] domas this includes validating if the collation is _indexable_ [11:52:32] domas see, this defines comparison rules [11:52:50] domas to simplify [11:52:59] domas this is possible: a>b, b>c, c>a [11:53:09] GerardM- given that we are talking about UTF-8 collation, there should be no problem ordering them [11:53:19] domas no, there is always problem [11:53:30] domas say, in some collation 'll' is same as 'l' [11:53:35] domas (thats real fact ;-) [11:53:55] GerardM- in proper Dutch collation ij is one character [11:53:58] domas right [11:54:05] domas how to index that? [11:54:48] domas I've opened quite a few bugs on issues like that in past few years :) [11:54:52] domas in mysql [11:54:56] GerardM- the UTF-8 solution would be that there is a seperate character for the ij [11:55:15] domas yes, but if people enter 'ij' it has to be treated in same way as that special character [11:55:20] GerardM- there are many such issues [11:55:22] GerardM- yes [11:55:24] domas and how does prefixindex work for 'i%' ? [11:55:30] domas should it include ij's or not? [11:55:32] domas :) [11:55:38] domas (it won't) [11:55:41] domas oh well. [11:55:44] domas or it may [11:55:46] domas but it will break indexing [11:56:00] domas c) [11:56:04] GerardM- ok so the question is how can we have MySQL and the CLDR people work together [11:56:25] GerardM- because it is in everyone's interest to get to a standard solution [11:57:09] domas well, mysql supports LDMLs or whichever was the format [11:57:14] domas here we come to c) [11:57:18] domas and c) is [11:57:23] domas deployment [11:57:47] domas there're two major problems here [11:58:19] domas during conversion to different charset one would have diverging datasets in different orders [11:58:45] domas so, replication can cause data drift [11:58:53] domas another is uniqueness [11:59:00] domas e.g. the change I showed you in lt.xml [11:59:09] domas it is changing unique constraints [11:59:21] domas 'i' and 'y' were separate in 'standard', and are together in 'draft' [12:00:06] domas so, there would have to be a way to resolve all the collisions in page titles [12:00:15] domas which is no longer a technology problem, as community has to be involved [12:00:50] GerardM- sorting and constraints are imho separate things [12:00:57] domas you think so :)) [12:02:16] GerardM- when i and y are equal, you show them next to each other [12:02:17] GerardM- when the constraint has it that there can only be one i or y it is a separate rule that can be implemented in a different way [12:02:24] domas well [12:02:31] domas you cannot have different constraints for ordering/ uniqueness [12:02:33] GerardM- I am sure that it can be implemented in a stored procedure [12:02:36] domas as in, different rulesets [12:03:51] domas tertiary weights would help here, probably, but on the other hand [12:03:56] domas 'i' and 'y' _are_ same :) [12:04:09] domas or, well, who is authority on uniqueness then [12:04:13] domas if we have one on sorting [12:04:36] GerardM- the issue you raised was how to deal with existing data where i and y both exist [12:05:06] domas well, or 'i' and 'I' as this would lead to case insensitive unique constraints [12:05:31] GerardM- so allowing for existing garbage means you have to be able to sort. uniqueness rules would then exist for new content [12:05:49] domas you can't [12:05:56] domas uniqueness rules are enforced by indexing [12:06:11] GerardM- that is one way of doing it [12:06:25] GerardM- technically the best [12:06:35] domas I understand the other way [12:06:39] domas and I don't want to hear about it ;-) [12:06:44] GerardM- <grin> [12:07:04] domas it would probably mean converting all the indexing into non-unique [12:07:12] domas not having any varchar PKs [12:07:31] domas building another layer for multiple-weight comparisons [12:07:51] domas etc [12:07:57] domas complexity is insane [12:08:01] domas we don't do insanely complex things :)) [12:08:09] GerardM- So the question is, how do we get to an open standard for this ? [12:08:23] GerardM- and how do we get some proper sorting for MediaWiki ? [12:08:24] domas well, let's say CLDR _is_ an open standard [12:08:38] domas even then, there's huge work to migrate all that data to MW's data store [12:08:40] domas and maintain it there [12:08:48] domas as by a), b), c) [12:08:50] domas let me add d) [12:09:01] GerardM- CLDR data IS already implemented in some of the extensions !! [12:09:03] domas amount of DBA work, with staging [12:09:12] domas CLDR data is easy to implement [12:09:20] domas some of CLDR data is implemented as mysql collations, or will be [12:09:27] domas maybe all mysql collations will move to CLDR data in future [12:09:46] domas bugs like that get opened: http://bugs.mysql.com/bug.php?id=37898 [12:10:27] GerardM- That is cool .. [12:10:47] GerardM- so people can ask MySQL to implement particular collations ? [12:10:48] domas and... collation changes are _incompatible_changes_ on datasets [12:10:54] domas of course [12:11:04] GerardM- do they get implemented ? [12:11:08] domas takes a while [12:11:14] domas it goes to next alpha/development release [12:11:28] GerardM- right but that is normal in standards as well [12:11:50] GerardM- so how long would it take .. given a positive approach [12:12:55] domas make mediawiki support CLDR - easy [12:12:59] domas as in, the stack [12:13:12] domas make the stack support CLDR as a process suitable for Wikimedia - extreme pain [12:14:25] GerardM- on the phone [12:14:28] domas okie [12:14:35] domas (btw, can I publish this transcript on mailing list? :) [12:14:45] domas it is quite informative [12:15:14] GerardM- <grin> we are not done talking and I was to ask you if I could blog about this :) [12:15:34] domas I'm just asking in advance! [12:15:43] domas btw, as a fun note [12:15:59] domas CLDR does not resolve problems, where dictionaries are needed for sorting [12:16:07] domas my favorite example, provided by River: Welsh. [12:16:13] domas \o/ [12:17:10] domas gonna go grab some coffee too [12:20:29] domas hold on, switching back from GPRS to DSL [12:20:39] * Disconnected (). [12:20:50] * Disconnected (). [12:21:00] * Disconnected (). [12:21:46] * domas_ sets mode -e domas_ [12:21:46] * services. sets mode +e domas [12:21:49] domas back! [12:45:19] GerardM- Hoi [12:45:25] GerardM- back from the phone [12:46:25] GerardM- given that there are over 7000 languages and that there are over 25000 linguistic entities (including dialects and orthographies) it is certainly a "growth area [12:48:28] GerardM- What I am looking for is that we at least support the WMF languages properly and that is already enough of a challenge [13:04:59] domas hehe [13:05:02] domas as I've said [13:05:03] domas for now [13:05:06] domas it is huge huge effort [13:05:35] GerardM- we can do one language at a time <grin> and start with English </grin> [13:07:40] GerardM- What you do not say is that there is an alternative to this ... so it is a job that needs doing [13:08:11] GerardM- and imho making sure that it fits in with what is the standard makes it a best effort [13:08:32] domas english is easiest [13:08:35] domas kind of [13:08:44] domas because we sort english properly even now %) [13:09:18] domas and to have this going forward, we need multiple people to work on this project [13:09:34] domas DBA, engine development, content conflict testing, collation evaluation, etc [13:09:44] domas + more hardware for regular staging :) [13:09:59] domas + rethinking of case sensitivity issues [13:10:15] GerardM- if we would do it TOGETHER with the Unicode people, we could get funding for it [13:10:37] domas I personally am too lazy to work on this project :)) [13:11:38] GerardM- there are other things where you can make a difference as well [13:12:19] GerardM- I learned from Siebrand at Berlin about something that would make our servers more efficient and thereby allow for growth for a couple of months ... [13:12:28] GerardM- had to do with loading messages [13:12:42] domas thats for appservers [13:12:43] domas :) [13:12:46] domas thats my plan! [13:12:49] domas my project! [13:13:12] GerardM- That is certainly important [13:13:32] GerardM- and <grin> there is certain to be much more that needs doing [13:13:37] domas yes [13:13:49] domas all this CLDR stuff is _very_ complex, when it comes to actual implementation [13:14:00] domas do note, we're already way ahead of curve of others when it comes to unicode [13:14:11] domas when I had to talk about certain charset/collation issues in a conference [13:14:19] domas the only public implementation of unicode normalization in software was mediawiki [13:14:20] domas :) [13:14:30] GerardM- <grin> you know my rants about Lingala ? We sure have a long way to go !! </grin> [13:22:03] GerardM- Would MySQL be interested in this project ? [13:22:55] domas file a bug, and see! :) [13:23:20] GerardM- this would be to implement CLDR collation on the Wikipedia projects [13:23:48] GerardM- for them it would be a research project ? [13:24:00] domas why research project [13:24:03] domas we know all the problems already :) [13:24:19] GerardM- the problems of implementing CLDR ? [13:24:29] domas as I've said, MySQL supports LDML [13:24:48] domas http://dev.mysql.com/doc/refman/5.0/en/adding-collation-unicode-uca.html [13:25:07] domas as I've said [13:25:09] domas CLDR is just a repo [13:25:13] GerardM- skyoing [13:25:15] domas LDML is spec, standard [13:30:04] GerardM- reading about LDML to find the relation to the collation [13:32:41] domas lol [13:32:47] domas Each collation must have a unique ID. To add a new collation, you must choose an ID value that is not currently used. The value must be in the range from 1 to 254. The collation ID that you choose will show up in these contexts: [13:32:48] domas hahahahaha [13:32:51] domas thats awesome limitation [13:33:15] domas As of MySQL 6.0.8, the range of IDs from 1024 to 2047 is reserved for user-defined collations. [13:33:18] domas thats better [13:33:23] GerardM- yes [13:33:24] GerardM- indeed [13:36:06] GerardM- so how do we currently deal with the languages from India where the order of Unicode is almost certainly to be wrong [13:36:17] domas well, currently we're using byte order [13:36:24] domas it is not any kind of unicode order [13:36:35] GerardM- so there is no proper sorting [13:36:36] domas as utf8 is variable length, offsets of character starts are different [13:43:57] GerardM- fyi Martin Benjamin is working on the Afrigen project, a project to get CLDR data for African languages [13:44:20] domas as I've said [13:44:22] domas using that data statically [13:44:28] domas on a mediawiki deployment [13:44:30] domas is very easy [13:44:34] domas the process is difficult [13:45:36] GerardM- is having this deployed in different ways per database instance going to have an impact ? [13:46:03] domas it isn't problem with an instance [13:46:06] domas it is problem with data [13:46:45] GerardM- right and this get you in either conversions or in inabilties to deal with them [13:47:13] domas yup [13:47:18] domas it is both problematic and labor-intensive work [13:47:32] domas anyway [13:47:38] domas switching to generic unicode sorting [13:47:41] domas would be huge step ahead [13:47:55] domas but we're not ready for that atm, simpy because until mysql6.0, there's no unicode support outside BMP [13:48:45] GerardM- so when would mysql6 be an option ? [13:49:19] domas *shrug*, it is alpha now [13:49:50] GerardM- <grin> it would be an option when you say.. I will support it </grin> [13:50:10] domas yes, kind of