On Fri, May 15, 2009 at 4:22 AM, Tisza Gergő gtisza@gmail.com wrote:
Would it be very expensive to have a separate (namespace, title, sortkey) table, and join on that for queries that need sorting?
You would have to scan the *entire* table you're joining from (which may be hundreds of millions of rows). Not a possibility.
On Fri, May 15, 2009 at 5:47 AM, Tisza Gergő gtisza@gmail.com wrote:
Coding the first or second type of collation rule seems relatively simple, and already a huge gain. (Also, RFC 3454 might be worth checking out as it has language-independent rules for more than diacritics.)
I agree.
You can have a separate raw_sortkey column if that's a large concern.
That would still mean an UPDATE of many millions of rows. Plus you'd add another column to a table that's already very large -- categorylinks is ~40,000,000 rows on enwiki, and that's an extra 40m varchar(255)s clogging up the buffer pool even though they're never going to be used except for the occasional update.
Anyway, this is the same for any solution that does not rely on MySQL collation: when the rules change, you need to update the relevant column in the database.
Correct. In fact, when MySQL's rules change you also have to rebuild the index, AFAIK.
What are the chances that we get decent MySQL collation in the close future (say, next few years)?
If we don't upgrade, I'd say about 0%. :) Even if we do, there are still the uniqueness problems, and the non-BMP problem. So not very good, I'd say, for our purposes. (That's not to say MySQL collation isn't decent for other purposes).