On Wed, Jul 21, 2010 at 6:45 PM, Conrad Irwin conrad.irwin@gmail.com wrote:
I don't have an example to hand (as the page is not yet complete on Wiktionary) The Hungarian letter "cs" sorts after "c", so while in English "cs" (for centi-seconds) should come before "CV", in Hungarian the entry for the letter (which is missing) should come afterwards. Both English and Hungarian would be on the same Wiktionary page.
Okay, I see. I don't think this would be terribly hard, although I don't think it's needed for an initial implementation. The major problem I see is that if the sort collation is per-category, then changing it on a preexisting large category will require reparsing all the pages, probably. (Unless we store the raw sortkeys as well.)
Some languages treat accented letters as the same primary letter, and use it only in the secondary or tertiary sort key (Which the current category table's keys of 80 bytes are in danger of truncating), others have variations on a theme, again Hungarian makes a good example, ö and ő are the one letter with two stresses, o and ó is a different letter. It should be automatically possible to extract the first letter from the words to be sorted (I don't know if ICU covers that, if not, just ask some people who speak the language, or Wikipedia) - but it's not possible to get that information from the sort keys directly, so either we store the user provided sort key, and our derived sort key, so we can use the former to find the first letter at render time, or we just store the first letter.
I don't see an answer to my question here. Given a sorted list of sortkeys, possibly including the raw sortkey as well as the one that's been put through ICU/CLDR/whatever, what algorithm do you propose to break it up into sections labeled with first letters? In particular, any such algorithm should not conflict with the sort order, in the sense that you should not have three words A, B, C sorted as A < B < C where firstLetter(A) == firstLetter(C) != firstLetter(B). Is this reasonably possible to guarantee in all alphabetic languages' conventional sort orders?
If we do store the raw sort key, we could have some Language method to retrieve the section name, and just write our own implementations for various languages. However, I'm not sure this is worth the effort.
On Wed, Jul 21, 2010 at 7:03 PM, Roan Kattouw roan.kattouw@gmail.com wrote:
It doesn't make a great deal of sense and can be changed fairly easily in Title::isValidMoveTarget().
On Thu, Jul 22, 2010 at 3:01 AM, Tim Starling tstarling@wikimedia.org wrote:
This restriction is enforced by Title::isValidMoveOperation().
Any objections to changing this so files can't be moved over non-files or vice versa?
An alternative would be to add a column to the categorylinks table, say cl_type. It could be an ENUM or some short text type. Then the index could be altered to include this field at the start of it.
Presumably the rationale for combining these two things into cl_sortkey is to avoid a schema change, and to make the paging code slightly simpler. But I worry that future generations of MediaWiki developers will curse us for laziness and obfuscation.
I'm okay with this.
Well, I've said ICU, possibly with a PHP simulation of some Western European sort key algorithm for the benefit of users without access to ICU. But I formed that opinion years ago, and I never properly surveyed all the possible solutions in the first place. It probably makes sense to do a little of your own research.
Gerard Meijssen felt strongly that we should use something based on CLDR. Apparently we have connections there and work with them a lot, and I guess he feels it's higher-quality or such.
Note that I specifically excluded the actual implementation of language-dependent sort keys from the requirements list when I wrote up this project. It could easily eat up a lot of time, and it's not necessary for a proof-of-principle implementation.
All right. Then I'll just do whatever's readily available and fits in the database column.
Work out how much space we would need to additionally store the category keys in plain text. Then we will know what sort of tradeoff we are looking at. Have you got a toolserver account you can use to do the sums?
Yes, I'm a toolserver root.
Since we won't be sorting on the plain text form anymore, we could use some tricks to save space. For instance, if the sort key is the same as the article title, we could store NULL instead of another copy of the article title. That should save 95% or so.
It doesn't seem like it would save nearly that much. On the Welsh Wikipedia (small enough database to be manageable), I get the following:
mysql> SELECT SUM(LENGTH(cl_sortkey)) FROM categorylinks JOIN page ON cl_from=page_id WHERE REPLACE(cl_sortkey, ' ', '_') != page_title; +-------------------------+ | SUM(LENGTH(cl_sortkey)) | +-------------------------+ | 551851 | +-------------------------+ 1 row in set (1.94 sec)
mysql> SELECT SUM(LENGTH(cl_sortkey)) FROM categorylinks JOIN page ON cl_from=page_id; +-------------------------+ | SUM(LENGTH(cl_sortkey)) | +-------------------------+ | 1619747 | +-------------------------+ 1 row in set (0.44 sec)
mysql> SELECT SUM(LENGTH(cl_sortkey)) FROM categorylinks JOIN page ON cl_from=page_id WHERE REPLACE(cl_sortkey, ' ', '_') != page_title AND page_namespace = 0; +-------------------------+ | SUM(LENGTH(cl_sortkey)) | +-------------------------+ | 347539 | +-------------------------+ 1 row in set (0.20 sec)
mysql> SELECT SUM(LENGTH(cl_sortkey)) FROM categorylinks JOIN page ON cl_from=page_id WHERE page_namespace = 0; +-------------------------+ | SUM(LENGTH(cl_sortkey)) | +-------------------------+ | 1067588 | +-------------------------+ 1 row in set (0.19 sec)
I filtered out the main namespace in the last two to avoid false positives from namespace prefixes. This suggests savings of maybe 50-75%. The story may be different on larger wikis. It's worth remembering, though, that a lot of these sortkeys might be set to work around deficiencies in the current default sortkey generation, so maybe it would be higher savings in the long term.
It's still not at all clear to me that saving a raw copy in the database is worth it. If we really need sectioning by first letter on category pages, we could save the first letter instead, and leave that NULL when it's the same as the first letter of the page title (all of this for some locale-specific definition of "first letter"). But I don't know if we need that.
On Thu, Jul 22, 2010 at 3:37 AM, Roan Kattouw roan.kattouw@gmail.com wrote:
There is another reason to prefer this schema, which is that the orginially proposed one is susceptible to weird transition bugs. After this feature is deployed, there will be old-format (i.e. plain) sortkeys sticking around in the database for quite some time after (they won't go away until LinksUpdate fixes them), which means that pages whose sortkey starts with a C or F will be recognized as categories and files respectively, even if they're normal pages.
The best way to mitigate that is to populate the namespace information prior to deployment. In Tim's schema, that means filling the cl_type field based on page_namespace. In the sortkey prefix schema, that means prefixing the sortkey with the relevant sortkey, but that also requires the sortkey updating code has already been updated at that time (so it doesn't overwrite new-style sortkeys with old-style ones), which means you'd have to partially deploy the code while running the population script. Yuck.
This whole problems arises for sortkey changes generally. It will be just as much of a problem when going to a new sortkey type (based on CLDR or whatever). The only way to avoid it is to create a new column, populate it while maintaining both columns at once, start using the new column once it's fully populated, and then drop the old column. That seems excessive. Remember that we can convert the current raw sortkey into ICU/CLDR/whatever without reparsing pages, as long as we can reliably tell old from new sortkeys (should be pretty easy to do heuristically). So it shouldn't take forever -- surely no more than a day or two even for enwiki.
On Thu, Jul 22, 2010 at 5:34 AM, David Gerard dgerard@gmail.com wrote:
Please don't remove the feature where the first letter of the sort key is displayed in the rendered category page, and if necessary add what it takes to keep it.
There are scripts where this will be a hard problem, but it's still much-used and much-loved in those where it isn't.
Is it? What use does it serve? We don't have it for any other type of list. We have zillions of types of page lists, and category pages are the only ones that have the first letter displayed. It makes the columns uneven, and is completely crazy for some scripts (like CJK, AFAICT).