Re: [Wikitech-l] More aggressive DEFAULTSORT

14 May 2009


      Hi!
me and Gerard had some chat about this stuff today, I'll post  
transcript if anyone is interested:
[11:47:03] 	GerardM-	Domas, could you explain to me why it would make  
a full time job to implement proper collations ? Would it not make  
sense when it is part of the standard MySQL ?
[11:47:37] 	domas	well, a) someone would have to actually fix CLDR stuff
[11:47:56] 	GerardM-	what is wrong with it ?
[11:47:56] 	domas	see, if proposed/draft is better than 'stable', etc
[11:47:59] 	domas	it has errors
[11:48:10] 	domas	I looked at random collation, and could see errors
[11:48:48] 	domas	example: http://unicode.org/cldr/data/common/collation/lt.xml?rev=1.26&content-ty...
[11:49:36] 	GerardM-	what is wrong with it ? and also is there a  
better approach to this ?
[11:49:52] 	GerardM-	it seems to me that having a standard for this is  
the way to go
[11:50:03] 	domas	well, status quo is people maintaining this separately
[11:50:07] 	domas	we at mysql maintain it separately
[11:50:10] 	domas	oracle maintain it separately
[11:50:10] 	domas	etc
[11:50:14] 	domas	CLDR is not a standard yet
[11:50:25] 	domas	it is a good source of information, yes
[11:50:37] 	GerardM-	you mean it is not THE standard that everyone  
adheres to
[11:50:44] 	domas	it is not standard per se
[11:50:48] 	domas	CLDR is a project, a collection
[11:50:59] 	domas	a repository
[11:51:16] 	GerardM-	it is treated with the same care as other  
standards .. eg ISO-639
[11:51:29] 	domas	well, probably the care is similar, yes, just the  
amount of data there is huge
[11:51:38] 	domas	and it is a bit sloppy
[11:51:45] 	domas	I'm sure ISO639 has way less data
[11:51:46] 	domas	:)
[11:51:48] 	domas	anyway
[11:51:52] 	domas	b)
[11:51:56] 	domas	actual merging of this
[11:52:14] 	domas	this includes validating if the collation is  
_indexable_
[11:52:32] 	domas	see, this defines comparison rules
[11:52:50] 	domas	to simplify
[11:52:59] 	domas	this is possible: a>b, b>c, c>a
[11:53:09] 	GerardM-	given that we are talking about UTF-8 collation,  
there should be no problem ordering them
[11:53:19] 	domas	no, there is always problem
[11:53:30] 	domas	say, in some collation 'll' is same as 'l'
[11:53:35] 	domas	(thats real fact ;-)
[11:53:55] 	GerardM-	in proper Dutch collation ij is one character
[11:53:58] 	domas	right
[11:54:05] 	domas	how to index that?
[11:54:48] 	domas	I've opened quite a few bugs on issues like that in  
past few years :)
[11:54:52] 	domas	in mysql
[11:54:56] 	GerardM-	the UTF-8 solution would be that there is a  
seperate character for the ij
[11:55:15] 	domas	yes, but if people enter 'ij' it has to be treated  
in same way as that special character
[11:55:20] 	GerardM-	there are many such issues
[11:55:22] 	GerardM-	yes
[11:55:24] 	domas	and how does prefixindex work for 'i%' ?
[11:55:30] 	domas	should it include ij's or not?
[11:55:32] 	domas	:)
[11:55:38] 	domas	(it won't)
[11:55:41] 	domas	oh well.
[11:55:44] 	domas	or it may
[11:55:46] 	domas	but it will break indexing
[11:56:00] 	domas	c)
[11:56:04] 	GerardM-	ok so the question is how can we have MySQL and  
the CLDR people work together
[11:56:25] 	GerardM-	because it is in everyone's interest to get to a  
standard solution
[11:57:09] 	domas	well, mysql supports LDMLs or whichever was the format
[11:57:14] 	domas	here we come to c)
[11:57:18] 	domas	and c) is
[11:57:23] 	domas	deployment
[11:57:47] 	domas	there're two major problems here
[11:58:19] 	domas	during conversion to different charset one would  
have diverging datasets in different orders
[11:58:45] 	domas	so, replication can cause data drift
[11:58:53] 	domas	another is uniqueness
[11:59:00] 	domas	e.g. the change I showed you in lt.xml
[11:59:09] 	domas	it is changing unique constraints
[11:59:21] 	domas	'i' and 'y' were separate in 'standard', and are  
together in 'draft'
[12:00:06] 	domas	so, there would have to be a way to resolve all the  
collisions in page titles
[12:00:15] 	domas	which is no longer a technology problem, as  
community has to be involved
[12:00:50] 	GerardM-	sorting and constraints are imho separate things
[12:00:57] 	domas	you think so :))
[12:02:16] 	GerardM-	when i and y are equal, you show them next to  
each other
[12:02:17] 	GerardM-	when the constraint has it that there can only be  
one i or y it is a separate rule that can be implemented in a  
different way
[12:02:24] 	domas	well
[12:02:31] 	domas	you cannot have different constraints for ordering/ 
uniqueness
[12:02:33] 	GerardM-	I am sure that it can be implemented in a stored  
procedure
[12:02:36] 	domas	as in, different rulesets
[12:03:51] 	domas	tertiary weights would help here, probably, but on  
the other hand
[12:03:56] 	domas	'i' and 'y' _are_ same :)
[12:04:09] 	domas	or, well, who is authority on uniqueness then
[12:04:13] 	domas	if we have one on sorting
[12:04:36] 	GerardM-	the issue you raised was how to deal with  
existing data where i and y both exist
[12:05:06] 	domas	well, or 'i' and 'I' as this would lead to case  
insensitive unique constraints
[12:05:31] 	GerardM-	so allowing for existing garbage means you have  
to be able to sort. uniqueness rules would then exist for new content
[12:05:49] 	domas	you can't
[12:05:56] 	domas	uniqueness rules are enforced by indexing
[12:06:11] 	GerardM-	that is one way of doing it
[12:06:25] 	GerardM-	technically the best
[12:06:35] 	domas	I understand the other way
[12:06:39] 	domas	and I don't want to hear about it ;-)
[12:06:44] 	GerardM-	<grin>
[12:07:04] 	domas	it would probably mean converting all the indexing  
into non-unique
[12:07:12] 	domas	not having any varchar PKs
[12:07:31] 	domas	building another layer for multiple-weight comparisons
[12:07:51] 	domas	etc
[12:07:57] 	domas	complexity is insane
[12:08:01] 	domas	we don't do insanely complex things :))
[12:08:09] 	GerardM-	So the question is, how do we get to an open  
standard for this ?
[12:08:23] 	GerardM-	and how do we get some proper sorting for  
MediaWiki ?
[12:08:24] 	domas	well, let's say CLDR _is_ an open standard
[12:08:38] 	domas	even then, there's huge work to migrate all that  
data to MW's data store
[12:08:40] 	domas	and maintain it there
[12:08:48] 	domas	as by a), b), c)
[12:08:50] 	domas	let me add d)
[12:09:01] 	GerardM-	CLDR data IS already implemented in some of the  
extensions !!
[12:09:03] 	domas	amount of DBA work, with staging
[12:09:12] 	domas	CLDR data is easy to implement
[12:09:20] 	domas	some of CLDR data is implemented as mysql  
collations, or will be
[12:09:27] 	domas	maybe all mysql collations will move to CLDR data in  
future
[12:09:46] 	domas	bugs like that get opened: http://bugs.mysql.com/bug.php?id=37898
[12:10:27] 	GerardM-	That is cool ..
[12:10:47] 	GerardM-	so people can ask MySQL to implement particular  
collations ?
[12:10:48] 	domas	and... collation changes are _incompatible_changes_  
on datasets
[12:10:54] 	domas	of course
[12:11:04] 	GerardM-	do they get implemented ?
[12:11:08] 	domas	takes a while
[12:11:14] 	domas	it goes to next alpha/development release
[12:11:28] 	GerardM-	right but that is normal in standards as well
[12:11:50] 	GerardM-	so how long would it take .. given a positive  
approach
[12:12:55] 	domas	make mediawiki support CLDR - easy
[12:12:59] 	domas	as in, the stack
[12:13:12] 	domas	make the stack support CLDR as a process suitable  
for Wikimedia - extreme pain
[12:14:25] 	GerardM-	on the phone
[12:14:28] 	domas	okie
[12:14:35] 	domas	(btw, can I publish this transcript on mailing  
list? :)
[12:14:45] 	domas	it is quite informative
[12:15:14] 	GerardM-	<grin> we are not done talking and I was to ask  
you if I could blog about this :)
[12:15:34] 	domas	I'm just asking in advance!
[12:15:43] 	domas	btw, as a fun note
[12:15:59] 	domas	CLDR does not resolve problems, where dictionaries  
are needed for sorting
[12:16:07] 	domas	my favorite example, provided by River: Welsh.
[12:16:13] 	domas	\o/
[12:17:10] 	domas	gonna go grab some coffee too
[12:20:29] 	domas	hold on, switching back from GPRS to DSL
[12:20:39] 	*	Disconnected ().
[12:20:50] 	*	Disconnected ().
[12:21:00] 	*	Disconnected ().
[12:21:46] 	*	domas_ sets mode -e domas_
[12:21:46] 	*	services. sets mode +e domas
[12:21:49] 	domas	back!
[12:45:19] 	GerardM-	Hoi
[12:45:25] 	GerardM-	back from the phone
[12:46:25] 	GerardM-	given that there are over 7000 languages and that  
there are over 25000 linguistic entities (including dialects and  
orthographies) it is certainly a "growth area
[12:48:28] 	GerardM-	What I am looking for is that we at least support  
the WMF languages properly and that is already enough of a challenge
[13:04:59] 	domas	hehe
[13:05:02] 	domas	as I've said
[13:05:03] 	domas	for now
[13:05:06] 	domas	it is huge huge effort
[13:05:35] 	GerardM-	we can do one language at a time <grin> and start  
with English </grin>
[13:07:40] 	GerardM-	What you do not say is that there is an  
alternative to this ... so it is a job that needs doing
[13:08:11] 	GerardM-	and imho making sure that it fits in with what is  
the standard makes it a best effort
[13:08:32] 	domas	english is easiest
[13:08:35] 	domas	kind of
[13:08:44] 	domas	because we sort english properly even now %)
[13:09:18] 	domas	and to have this going forward, we need multiple  
people to work on this project
[13:09:34] 	domas	DBA, engine development, content conflict testing,  
collation evaluation, etc
[13:09:44] 	domas	+ more hardware for regular staging :)
[13:09:59] 	domas	+ rethinking of case sensitivity issues
[13:10:15] 	GerardM-	if we would do it TOGETHER with the Unicode  
people, we could get funding for it
[13:10:37] 	domas	I personally am too lazy to work on this project :))
[13:11:38] 	GerardM-	there are other things where you can make a  
difference as well
[13:12:19] 	GerardM-	I learned from Siebrand at Berlin about something  
that would make our servers more efficient and thereby allow for  
growth for a couple of months ...
[13:12:28] 	GerardM-	had to do with loading messages
[13:12:42] 	domas	thats for appservers
[13:12:43] 	domas	:)
[13:12:46] 	domas	thats my plan!
[13:12:49] 	domas	my project!
[13:13:12] 	GerardM-	That is certainly important
[13:13:32] 	GerardM-	and <grin> there is certain to be much more that  
needs doing
[13:13:37] 	domas	yes
[13:13:49] 	domas	all this CLDR stuff is _very_ complex, when it comes  
to actual implementation
[13:14:00] 	domas	do note, we're already way ahead of curve of others  
when it comes to unicode
[13:14:11] 	domas	when I had to talk about certain charset/collation  
issues in a conference
[13:14:19] 	domas	the only public implementation of unicode  
normalization in software was mediawiki
[13:14:20] 	domas	:)
[13:14:30] 	GerardM-	<grin> you know my rants about Lingala ? We sure  
have a long way to go !! </grin>
[13:22:03] 	GerardM-	Would MySQL be interested in this project ?
[13:22:55] 	domas	file a bug, and see! :)
[13:23:20] 	GerardM-	this would be to implement CLDR collation on the  
Wikipedia projects
[13:23:48] 	GerardM-	for them it would be a research project ?
[13:24:00] 	domas	why research project
[13:24:03] 	domas	we know all the problems already :)
[13:24:19] 	GerardM-	the problems of implementing CLDR ?
[13:24:29] 	domas	as I've said, MySQL supports LDML
[13:24:48] 	domas	http://dev.mysql.com/doc/refman/5.0/en/adding-collation-unicode-uca.html
[13:25:07] 	domas	as I've said
[13:25:09] 	domas	CLDR is just a repo
[13:25:13] 	GerardM-	skyoing
[13:25:15] 	domas	LDML is spec, standard
[13:30:04] 	GerardM-	reading about LDML to find the relation to the  
collation
[13:32:41] 	domas	lol
[13:32:47] 	domas	Each collation must have a unique ID. To add a new  
collation, you must choose an ID value that is not currently used. The  
value must be in the range from 1 to 254. The collation ID that you  
choose will show up in these contexts:
[13:32:48] 	domas	hahahahaha
[13:32:51] 	domas	thats awesome limitation
[13:33:15] 	domas	As of MySQL 6.0.8, the range of IDs from 1024 to  
2047 is reserved for user-defined collations.
[13:33:18] 	domas	thats better
[13:33:23] 	GerardM-	yes
[13:33:24] 	GerardM-	indeed
[13:36:06] 	GerardM-	so how do we currently deal with the languages  
from India where the order of Unicode is almost certainly to be wrong
[13:36:17] 	domas	well, currently we're using byte order
[13:36:24] 	domas	it is not any kind of unicode order
[13:36:35] 	GerardM-	so there is no proper sorting
[13:36:36] 	domas	as utf8 is variable length, offsets of character  
starts are different
[13:43:57] 	GerardM-	fyi Martin Benjamin is working on the Afrigen  
project, a project to get CLDR data for African languages
[13:44:20] 	domas	as I've said
[13:44:22] 	domas	using that data statically
[13:44:28] 	domas	on a mediawiki deployment
[13:44:30] 	domas	is very easy
[13:44:34] 	domas	the process is difficult
[13:45:36] 	GerardM-	is having this deployed in different ways per  
database instance going to have an impact ?
[13:46:03] 	domas	it isn't problem with an instance
[13:46:06] 	domas	it is problem with data
[13:46:45] 	GerardM-	right and this get you in either conversions or  
in inabilties to deal with them
[13:47:13] 	domas	yup
[13:47:18] 	domas	it is both problematic and labor-intensive work
[13:47:32] 	domas	anyway
[13:47:38] 	domas	switching to generic unicode sorting
[13:47:41] 	domas	would be huge step ahead
[13:47:55] 	domas	but we're not ready for that atm, simpy because  
until mysql6.0, there's no unicode support outside BMP
[13:48:45] 	GerardM-	so when would mysql6 be an option ?
[13:49:19] 	domas	*shrug*, it is alpha now
[13:49:50] 	GerardM-	<grin> it would be an option when you say.. I  
will support it </grin>
[13:50:10] 	domas	yes, kind of

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] More aggressive DEFAULTSORT