El 5/12/09 1:49 PM, Aryeh Gregor escribió:
On Tue, May 12, 2009 at 4:38 PM, Brion Vibberbrion@wikimedia.org wrote:
- Collation use for sorting needs to be double-checked to confirm it
wouldn't interfere with present uniqueness constraints
Since cl_sortkey isn't part of any unique key, this appears not to be an issue for this use. Of course, it's an issue for every other sorted list of titles, but those can't have custom sort keys specified to begin with and don't seem to be included in this proposal. Perhaps they should be, though. In that case we'd probably end up needing an extra column in every single table that includes the page title, just for sorting (but we'd be able to use flexible algorithms to generate the sort key, rather than being stuck with MySQL's).
As a general issue we also need to consider managing paging through collation-sorted lists, since sort keys for different inputs may produce the same result. At the moment I think category lists are paged by offset (bad!) but we should ensure this is planned for.
- Multilingual sites possibly not well served by table-wide
language-specific coding
utf8 sorting would be a lot better than binary sorting for any site, I'm pretty sure. (I assume utf8 sorts sanely and not according to codepoint.)
Well, "utf8" doesn't tell you anything specific there... :) There's a "general" as well as "binary" which would be the same as what we do now (except for not supporting 4-byte characters AT ALL)
http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html
For a multilingual site we'd probably end up using utf8_unicode_ci, which at least partially implements the Unicode Collation Algorithm (UCA), which sounds kind of confusing since at least a glance at http://www.unicode.org/reports/tr10/ makes it quite explicit that collation properties are language-dependent... presumably that's an un-tailored version which won't have most language-specific properties.
Doing our own localized sort key encoding and adding another indexed column to sort on would avoid some dependency issues but has its own deployment and maintenance difficulties.
You don't need another column for categorylinks, you can use the existing cl_sortkey, so that should be relatively easy to deploy. It doesn't help with non-category use cases, of course.
You would if you need to store a processed sort key index that's not in the form of displayable characters. (eg, the output of the UCA)
It would also be possible to use a separate column for the collated sorting while using MySQL 4.1+'s native collations, if the uniqueness constraints are a problem, but this is still dependent on rolling out an upgrade from 4.0.
In that case we may as well make it like cl_sortkey and populate it ourselves, surely.
For the unique case of categorylinks yes. For everything else, additional columns are not already present.
-- brion