On Tue, May 12, 2009 at 5:46 PM, Brion Vibber brion@wikimedia.org wrote:
As a general issue we also need to consider managing paging through collation-sorted lists, since sort keys for different inputs may produce the same result. At the moment I think category lists are paged by offset (bad!) but we should ensure this is planned for.
Category lists use Pager, so they're paged by index offsets, not LIMIT M, N. Note that they should probably be ordered by (cl_sortkey, cl_from) or something instead of just (cl_sortkey) -- currently, equal sortkeys will cause problems. But Pager doesn't support multi-key sort right now.
I'm not sure what you mean here, though. What does "sort keys for different inputs may produce the same result" mean? You're just talking about sort key conflicts? In that case it seems best to just disambiguate by whatever's handy, in this case cl_from (which is the page_id and so not very meaningful). If it's coming up often enough to be a problem, the sort keys should be improved!
You don't need another column for categorylinks, you can use the existing cl_sortkey, so that should be relatively easy to deploy. It doesn't help with non-category use cases, of course.
You would if you need to store a processed sort key index that's not in the form of displayable characters. (eg, the output of the UCA)
Why? cl_sortkey isn't ever displayed to the user, so I don't see why it couldn't contain binary characters. I guess it's in the URL of links past the first page, but that's not a huge deal. Although it is a definite downside I didn't think of (it's nice to have manually-editable URLs!).
It would also be possible to use a separate column for the collated sorting while using MySQL 4.1+'s native collations, if the uniqueness constraints are a problem, but this is still dependent on rolling out an upgrade from 4.0.
In that case we may as well make it like cl_sortkey and populate it ourselves, surely.
For the unique case of categorylinks yes. For everything else, additional columns are not already present.
I was saying that if we were going to make extra columns, we may as well roll our own sort keys instead of bothering with collations, since it's not like we'd save a column. But of course if rolling our own would mean two extra columns instead of one, that would be a definite downside. Still, MySQL's collation support is unlikely to ever extend to nearly as many languages as we support, and it can't handle niceties like eliding initial "A" or "The" in English, say. So it doesn't seem like as good a solution.