2010/7/21 Aryeh Gregor <Simetrical+wikilist(a)gmail.com>om>:
Categories, files, and other types of pages cannot
be moved to one another, as far as I know (it would hardly make
sense), so it automatically stays consistent this way.
This is true for categories
but not for files:
http://www.mediawiki.org/w/index.php?title=Special:Log&dir=prev&off…
So I propose that we keep the current category/normal
page/file split,
and paginate those three parts of the page separately. So you'd have
up to 200 subcategories, then below that up to 200 normal pages, then
below that up to 200 files. (The numbers could be adjusted.
Currently they're hardcoded, which is stupid.) Paginating
subcategories separately is obviously needed. Paginating files
separately is not really needed, but it would be much more consistent.
Sounds good to me. I do think we will want to page all three of these
things separately, it'd be stupidly inconsistent not to do that.
The overall solution, then, would be:
1) Change the way category sortkeys are generated. Start them with a
letter depending on namespace, like 'C' for category, 'P' for regular
page, 'F' for file. After that first letter, append a sortkey
generated by ICU or whatever. I think Tim has opinions on what would
be a good choice to convert the article title into sort key -- if not,
I'll have to research it and hopefully not come up with a completely
incorrect answer.
Note that different languages will want different orders. For
instance, German generally sorts ä as ae, ö as oe and ü as ue, whereas
the Swedish sort å, ä and ö at the end of the alphabet (so they
actually say A, B, C, ... Z, Å, Ä, Ö and use the phrase "from A to
Ö"). These collation schemes obviously conflict in their handling of ä
and ö, and I'm sure there's crazier stuff out there.
This could be solved by having a different collation scheme for each
content language (these have to be standardized *somewhere*, right?)
and using {{DEFAULTSORT:}} for those rare cases where you have an
article about a German person on a non-German wiki and want it to sort
the German way.
2) On category pages, maintain three offsets and do
three queries (or
maybe UNION them together, doesn't matter),
In my personal opinion, UNION makes
zero sense because you'd have to
pull the data apart again after querying it, as you're displaying it
separately as well. Separate queries are much cleaner in this case.
One problem that was pointed out somewhere in the
massive useless
discussion on bug 164 is that we'd have to do something to display the
first letter for each section. Currently it's just the first letter
of the sortkey, but if that's some binary string, that becomes a
problem. I'm not seeing an obvious solution, since the
sortkey-generation algorithm will be opaque to us. If it sorts Á the
same as A, then how do we figure out that the "canonical" first letter
for the section should be "A" and not "Á"? How do we even figure
out
where the sections begin or end? Would that even make sense in all
cases? At a first pass, I'd say we should just skip the first letter
and display all the items straight from beginning to end without
section divisions. I don't think that's a big problem.
I agree that the first-letter thing is a nice-to-have, but I'm more
worried about the general problem that sortkeys won't be
human-readable strings anymore (the API currently displays them and,
obviously, uses them for paging) nor possible to decode into
human-readable strings (because the encoding essentially loses
information when e.g. a and á are folded). It would be nice if we
could store the original, unmunged sortkey in the categorylinks table,
although I realize that would eat space for display and debugging
purposes only.
Roan Kattouw (Catrope)