On Fri, May 15, 2009 at 4:22 AM, Tisza Gergő <gtisza(a)gmail.com> wrote:
Would it be very expensive to have a separate
(namespace, title, sortkey) table,
and join on that for queries that need sorting?
You would have to scan the *entire* table you're joining from (which
may be hundreds of millions of rows). Not a possibility.
On Fri, May 15, 2009 at 5:47 AM, Tisza Gergő <gtisza(a)gmail.com> wrote:
Coding the first or second type of collation rule
seems relatively simple, and
already a huge gain. (Also, RFC 3454 might be worth checking out as it has
language-independent rules for more than diacritics.)
I agree.
You can have a separate raw_sortkey column if
that's a large concern.
That would still mean an UPDATE of many millions of rows. Plus you'd
add another column to a table that's already very large --
categorylinks is ~40,000,000 rows on enwiki, and that's an extra 40m
varchar(255)s clogging up the buffer pool even though they're never
going to be used except for the occasional update.
Anyway,
this is the same for any solution that does not rely on MySQL collation: when
the rules change, you need to update the relevant column in the database.
Correct. In fact, when MySQL's rules change you also have to rebuild
the index, AFAIK.
What are the chances that we get decent MySQL
collation in the close future
(say, next few years)?
If we don't upgrade, I'd say about 0%. :) Even if we do, there are
still the uniqueness problems, and the non-BMP problem. So not very
good, I'd say, for our purposes. (That's not to say MySQL collation
isn't decent for other purposes).