On 21 July 2010 14:49, Roan Kattouw roan.kattouw@gmail.com wrote:
2010/7/21 Aryeh Gregor Simetrical+wikilist@gmail.com:
Note that different languages will want different orders. For instance, German generally sorts ä as ae, ö as oe and ü as ue, whereas the Swedish sort å, ä and ö at the end of the alphabet (so they actually say A, B, C, ... Z, Å, Ä, Ö and use the phrase "from A to Ö"). These collation schemes obviously conflict in their handling of ä and ö, and I'm sure there's crazier stuff out there.
This could be solved by having a different collation scheme for each content language (these have to be standardized *somewhere*, right?) and using {{DEFAULTSORT:}} for those rare cases where you have an article about a German person on a non-German wiki and want it to sort the German way.
For Wiktionary, every language is included in one wiki (and even on one page) - it would be phenomenal to be able to select the collation per category. As per-page or per-wiki will not help very much at all.
- On category pages, maintain three offsets and do three queries (or
maybe UNION them together, doesn't matter),
In my personal opinion, UNION makes zero sense because you'd have to pull the data apart again after querying it, as you're displaying it separately as well. Separate queries are much cleaner in this case.
One problem that was pointed out somewhere in the massive useless discussion on bug 164 is that we'd have to do something to display the first letter for each section. Currently it's just the first letter of the sortkey, but if that's some binary string, that becomes a problem. I'm not seeing an obvious solution, since the sortkey-generation algorithm will be opaque to us. If it sorts Á the same as A, then how do we figure out that the "canonical" first letter for the section should be "A" and not "Á"? How do we even figure out where the sections begin or end? Would that even make sense in all cases? At a first pass, I'd say we should just skip the first letter and display all the items straight from beginning to end without section divisions. I don't think that's a big problem.
I agree that the first-letter thing is a nice-to-have, but I'm more worried about the general problem that sortkeys won't be human-readable strings anymore (the API currently displays them and, obviously, uses them for paging) nor possible to decode into human-readable strings (because the encoding essentially loses information when e.g. a and á are folded). It would be nice if we could store the original, unmunged sortkey in the categorylinks table, although I realize that would eat space for display and debugging purposes only.
There is no way to go from the sort-key to the first letter and indeed, you can't even put the first letter at the start of the sort key, as you need to sort the sections differently per language. The solution I use for generating the indices on Wiktionary is to store the first letter explicitly (either of the page or the user-provided sort key before they are fed into ICU). This would (in the future) allow "topical" categories, but that's juts a distraction for now.
Conrad