On 22/07/10 07:00, Aryeh Gregor wrote:
Categories, files, and other types of pages cannot be moved to one another, as far as I know (it would hardly make sense), so it automatically stays consistent this way.
This restriction is enforced by Title::isValidMoveOperation().
- Change the way category sortkeys are generated. Start them with a
letter depending on namespace, like 'C' for category, 'P' for regular page, 'F' for file. After that first letter, append a sortkey generated by ICU or whatever.
An alternative would be to add a column to the categorylinks table, say cl_type. It could be an ENUM or some short text type. Then the index could be altered to include this field at the start of it.
Presumably the rationale for combining these two things into cl_sortkey is to avoid a schema change, and to make the paging code slightly simpler. But I worry that future generations of MediaWiki developers will curse us for laziness and obfuscation.
I think Tim has opinions on what would be a good choice to convert the article title into sort key -- if not, I'll have to research it and hopefully not come up with a completely incorrect answer.
Well, I've said ICU, possibly with a PHP simulation of some Western European sort key algorithm for the benefit of users without access to ICU. But I formed that opinion years ago, and I never properly surveyed all the possible solutions in the first place. It probably makes sense to do a little of your own research.
Note that I specifically excluded the actual implementation of language-dependent sort keys from the requirements list when I wrote up this project. It could easily eat up a lot of time, and it's not necessary for a proof-of-principle implementation.
- On category pages, maintain three offsets and do three queries (or
maybe UNION them together, doesn't matter), one for each of categories/regular pages/files. Because of (1), this will be efficient and will also sort less unreasonably for non-English languages.
One problem that was pointed out somewhere in the massive useless discussion on bug 164 is that we'd have to do something to display the first letter for each section. Currently it's just the first letter of the sortkey, but if that's some binary string, that becomes a problem. I'm not seeing an obvious solution, since the sortkey-generation algorithm will be opaque to us. If it sorts Á the same as A, then how do we figure out that the "canonical" first letter for the section should be "A" and not "Á"? How do we even figure out where the sections begin or end? Would that even make sense in all cases? At a first pass, I'd say we should just skip the first letter and display all the items straight from beginning to end without section divisions. I don't think that's a big problem.
Roan is also asking for a store of the plain text form in this thread.
Work out how much space we would need to additionally store the category keys in plain text. Then we will know what sort of tradeoff we are looking at. Have you got a toolserver account you can use to do the sums?
Since we won't be sorting on the plain text form anymore, we could use some tricks to save space. For instance, if the sort key is the same as the article title, we could store NULL instead of another copy of the article title. That should save 95% or so.
-- Tim Starling