On 21 July 2010 15:28, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
On Wed, Jul 21, 2010 at 6:18 PM, Daniel Kinzler daniel@brightbyte.de wrote:
While we definitly need efficient retrieval by namespace, the default sort key should *not* include the namespace prefix. it's very annoying that all files get sorted under "F" currently, or that pages from the Wikipedia namespace all end up under "W".
That's totally orthogonal and is like a one-line change. Probably you just have to change getPrefixedDBkey() to getDBkey() somewhere.
On Wed, Jul 21, 2010 at 6:22 PM, Conrad Irwin conrad.irwin@gmail.com wrote:
Because there are multiple languages on each page - so you need lots of different sort keys.
Could you point me to an example of some pages and categories where this is an issue? I'm not clear on how categories/pages/sort keys are being used here.
I don't have an example to hand (as the page is not yet complete on Wiktionary) The Hungarian letter "cs" sorts after "c", so while in English "cs" (for centi-seconds) should come before "CV", in Hungarian the entry for the letter (which is missing) should come afterwards. Both English and Hungarian would be on the same Wiktionary page.
If it's sorted as starting with "a" it should appear under "a". The alternative would be to have different explicit sorting for the sections in the category than for the words in the section, which I think is unnecessary.
So if we have three pages "Áa", "Ab", "Ác" and they're sorted in the category in that order, should they be in one section? I don't see how you'd put them in two or three sections. If they're in one section, what letter do you use for it, "Á" or "A"? We can figure out "A" is correct here, but how do you do that in general automatically?
Some languages treat accented letters as the same primary letter, and use it only in the secondary or tertiary sort key (Which the current category table's keys of 80 bytes are in danger of truncating), others have variations on a theme, again Hungarian makes a good example, ö and ő are the one letter with two stresses, o and ó is a different letter. It should be automatically possible to extract the first letter from the words to be sorted (I don't know if ICU covers that, if not, just ask some people who speak the language, or Wikipedia) - but it's not possible to get that information from the sort keys directly, so either we store the user provided sort key, and our derived sort key, so we can use the former to find the first letter at render time, or we just store the first letter.
Conrad
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l