El 5/12/09 1:49 PM, Aryeh Gregor escribió:
On Tue, May 12, 2009 at 4:38 PM, Brion
Vibber<brion(a)wikimedia.org> wrote:
* Collation use for sorting needs to be
double-checked to confirm it
wouldn't interfere with present uniqueness constraints
Since cl_sortkey isn't part of any unique key, this appears not to be
an issue for this use. Of course, it's an issue for every other
sorted list of titles, but those can't have custom sort keys specified
to begin with and don't seem to be included in this proposal. Perhaps
they should be, though. In that case we'd probably end up needing an
extra column in every single table that includes the page title, just
for sorting (but we'd be able to use flexible algorithms to generate
the sort key, rather than being stuck with MySQL's).
As a general issue we also need to consider managing paging through
collation-sorted lists, since sort keys for different inputs may produce
the same result. At the moment I think category lists are paged by
offset (bad!) but we should ensure this is planned for.
* Multilingual
sites possibly not well served by table-wide
language-specific coding
utf8 sorting would be a lot better than binary sorting for any site,
I'm pretty sure. (I assume utf8 sorts sanely and not according to
codepoint.)
Well, "utf8" doesn't tell you anything specific there... :) There's a
"general" as well as "binary" which would be the same as what we do
now
(except for not supporting 4-byte characters AT ALL)
http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html
For a multilingual site we'd probably end up using utf8_unicode_ci,
which at least partially implements the Unicode Collation Algorithm
(UCA), which sounds kind of confusing since at least a glance at
http://www.unicode.org/reports/tr10/ makes it quite explicit that
collation properties are language-dependent... presumably that's an
un-tailored version which won't have most language-specific properties.
Doing our own
localized sort key encoding and adding another indexed
column to sort on would avoid some dependency issues but has its own
deployment and maintenance difficulties.
You don't need another column for categorylinks, you can use the
existing cl_sortkey, so that should be relatively easy to deploy. It
doesn't help with non-category use cases, of course.
You would if you need to store a processed sort key index that's not in
the form of displayable characters. (eg, the output of the UCA)
It would also
be possible to use a separate column for the collated
sorting while using MySQL 4.1+'s native collations, if the uniqueness
constraints are a problem, but this is still dependent on rolling out an
upgrade from 4.0.
In that case we may as well make it like cl_sortkey and populate it
ourselves, surely.
For the unique case of categorylinks yes. For everything else,
additional columns are not already present.
-- brion