On Wed, Jul 21, 2010 at 5:49 PM, Roan Kattouw <roan.kattouw(a)gmail.com> wrote:
Blech. Does this make any sense? Can we change it? It would simply
this considerably.
Note that different languages will want different
orders. For
instance, German generally sorts ä as ae, ö as oe and ü as ue, whereas
the Swedish sort å, ä and ö at the end of the alphabet (so they
actually say A, B, C, ... Z, Å, Ä, Ö and use the phrase "from A to
Ö"). These collation schemes obviously conflict in their handling of ä
and ö, and I'm sure there's crazier stuff out there.
This could be solved by having a different collation scheme for each
content language (these have to be standardized *somewhere*, right?)
and using {{DEFAULTSORT:}} for those rare cases where you have an
article about a German person on a non-German wiki and want it to sort
the German way.
Yes, of course. I'm assuming that the magical sortkey-generator I'm
plugging into here is locale-specific.
In my personal opinion, UNION makes zero sense because
you'd have to
pull the data apart again after querying it, as you're displaying it
separately as well. Separate queries are much cleaner in this case.
It's pretty simple to do either way. Makes no big difference.
I agree that the first-letter thing is a nice-to-have,
but I'm more
worried about the general problem that sortkeys won't be
human-readable strings anymore (the API currently displays them and,
obviously, uses them for paging) nor possible to decode into
human-readable strings (because the encoding essentially loses
information when e.g. a and á are folded). It would be nice if we
could store the original, unmunged sortkey in the categorylinks table,
although I realize that would eat space for display and debugging
purposes only.
This would also require altering the table. Why is it necessary? For
paging, we can just use cl_from to stick in the URL, and retrieve
cl_sortkey based on that and cl_to. That will make it be short and
not look horribly ugly. When do we ever need a human-readable form of
the sortkey, as opposed to a human-readable form of the title? API
users should keep working when this happens with no special code
changes on server or client, just they'll have horribly long and ugly
URLs with encoded binary. Sortkeys are often weird and not suitable
for display to humans anyway, like when "*" is used.
I'm not seeing this as worth adding a fourth field to categorylinks,
which is a huge table already.
On Wed, Jul 21, 2010 at 6:04 PM, Conrad Irwin <conrad.irwin(a)gmail.com> wrote:
For Wiktionary, every language is included in one wiki
(and even on
one page) - it would be phenomenal to be able to select the collation
per category. As per-page or per-wiki will not help very much at all.
Why won't per-page help? I'm not understanding clearly here. I don't
think it would be too much trouble to add per-page and per-category
parser functions to set the language used for sort keys, though.
There is no way to go from the sort-key to the first
letter and
indeed, you can't even put the first letter at the start of the sort
key, as you need to sort the sections differently per language. The
solution I use for generating the indices on Wiktionary is to store
the first letter explicitly (either of the page or the user-provided
sort key before they are fed into ICU). This would (in the future)
allow "topical" categories, but that's juts a distraction for now.
But different articles that are sorted as though they started with the
same letter might not actually start with the same letter, so how do
we figure out which first letter is the correct one? This is a
problem even if you're just dealing with accented letters -- I have no
idea how this stuff works (or doesn't work) for CJK or whatnot.
(Judging by these:
http://ja.wikipedia.org/wiki/Category:%E5%AD%98%E5%91%BD%E4%BA%BA%E7%89%A9
http://zh.wikipedia.org/wiki/Category:%E5%9C%A8%E4%B8%96%E4%BA%BA%E7%89%A9
http://zh-yue.wikipedia.org/wiki/Category:%E5%9C%A8%E4%B8%96%E4%BA%BA%E7%89…
the strategy is just to manually force sortkeys to begin with
something like "A" or "あ". Cantonese doesn't do this, and it ends
up
with one article per "letter" in many cases.)