2010/7/22 Tim Starling tstarling@wikimedia.org:
An alternative would be to add a column to the categorylinks table, say cl_type. It could be an ENUM or some short text type. Then the index could be altered to include this field at the start of it.
Presumably the rationale for combining these two things into cl_sortkey is to avoid a schema change, and to make the paging code slightly simpler. But I worry that future generations of MediaWiki developers will curse us for laziness and obfuscation.
There is another reason to prefer this schema, which is that the orginially proposed one is susceptible to weird transition bugs. After this feature is deployed, there will be old-format (i.e. plain) sortkeys sticking around in the database for quite some time after (they won't go away until LinksUpdate fixes them), which means that pages whose sortkey starts with a C or F will be recognized as categories and files respectively, even if they're normal pages.
The best way to mitigate that is to populate the namespace information prior to deployment. In Tim's schema, that means filling the cl_type field based on page_namespace. In the sortkey prefix schema, that means prefixing the sortkey with the relevant sortkey, but that also requires the sortkey updating code has already been updated at that time (so it doesn't overwrite new-style sortkeys with old-style ones), which means you'd have to partially deploy the code while running the population script. Yuck.
Roan is also asking for a store of the plain text form in this thread.
I have since conceded that it takes space for almost zero gain (as it doesn't solve the first letter problem and doesn't have much gain for the API either), see my previous post.
Roan Kattouw (Catrope)