On Thu, Jul 22, 2010 at 3:01 AM, Tim Starling tstarling@wikimedia.org wrote:
An alternative would be to add a column to the categorylinks table, say cl_type. It could be an ENUM or some short text type. Then the index could be altered to include this field at the start of it.
Presumably the rationale for combining these two things into cl_sortkey is to avoid a schema change, and to make the paging code slightly simpler. But I worry that future generations of MediaWiki developers will curse us for laziness and obfuscation.
One problem with this, though, that just occurred to me: won't it mean that any existing users of cl_sortkey will suddenly not be using an index? I see a whole bunch of extensions that do ORDER BY cl_sortkey. Hacking up cl_sortkey to store weird stuff in a prefix will make them sort things oddly, but they won't filesort large categories. I assume we don't want to keep an extra index just for these marginal uses.
On Thu, Jul 22, 2010 at 11:02 PM, Tim Starling tstarling@wikimedia.org wrote:
No objection. That's mostly how it is already, except when the file doesn't exist but the description page does.
Okay, done in r69802.
Welsh is not really what I was thinking when I said get statistics.
Well, it's a lot faster, particular on the toolserver. Wikimedia database servers don't have to cope with ludicrously stupid queries being run all the time. :)
So we're looking at 17 bytes per row for raw text, and 8 bytes per row for compacted text, plus 1 byte per row for the length byte. Overall, assuming the lengths are the same across all namespaces, it would be approximately 680 MB in the raw form for the English Wikipedia, and presumably several times that for all wikis. Our English Wikipedia core DB servers have between 700 GB and 2 TB of storage space, with ~450 GB currently in use. So the impact of adding an extra 1 GB or so would be minimal.
So within the range of my previous estimate in relative terms, 50% to 75% space saved.
Truncating after the first letter would only save about 260MB for the entire English Wikipedia. And it would limit the applications. For instance, it would prevent fast updates of the collation algorithm. Instead we would have to reparse the pages. That could take weeks, even with a dozen servers dedicated to the task.
Okay, I'll store the raw versions.
If we're going to have multiple locale-specific collation algorithms (and that seems likely), then it may make sense to add a collation ID foreign key to the categorylinks table, to track updates. Sensible sorting behaviour mid-way through an update is probably not feasible, but we can at least make it possible to track the problem.
One way to have sensible sorting behavior midway through collation (suggested by Philippe Verdy on bug 164, if I understood right, but at least that's the inspiration) would be to have a cl_collation to track this, and then extend the unique index on (cl_from, cl_to) to (cl_from, cl_to, cl_collation), adjust all other indexes similarly where necessary, and add WHERE cl_collation = 73 or whatever to all the queries. Then when switching collations, we could have the code start keeping them updated in parallel; do a batch job to add extra rows for the new collation where they don't exist; and when that's done, stop maintaining the old collation and DELETE it. Do you think that's a good idea? How often do we expect to have to change the collation?
We have zillions of lists, but category pages are by far the most visible and heavily-used, that's why so much work has been done on making them look nice, and why so many people are complaining about category sorting instead of [[Special:DeadendPages]] sorting.
Okay, so I'll make sure that the infrastructure is present for this to work, including for languages that don't want it to turn it off.
The uneven column issue is fixable using the multi-column layout feature in CSS 3 and more recent versions of the major browsers.
We'd have to scratch the "C cont." labels at the top of each column, though. Is that too much of an imposition on the page's prettiness? :)
On Fri, Jul 23, 2010 at 12:40 AM, Krinkle krinklemail@gmail.com wrote:
Something that comes to mind is Wikimedia Commons. Would the same category with a different uselang be sorted differently ?
No. That's not practical at all. It can only be per-category. (I don't plan to implement per-category sort orders in this implementation, unless I'm asked to.)
No objection either. However, make sure that the other way around is blocked too. Else one might accidently move a page to the File-namespace without being able to get it back.
Of course.
Okay, so I think I have enough feedback to write up an initial implementation. I'll start on that now. To begin with I'll probably just stick in a dummy collation function, like use "convert everything to uppercase" and test it with $wgCapitalLinks false. Outline of my initial implementation plans:
* Add $wgExperimentalCategorySort, default to false, for testing, so that I can commit this to trunk (I don't like using branches). * Add unindexed varchar(255) NULL cl_raw_sortkey column, to serve the same function as the current cl_sortkey, except it will be NULL if the raw sortkey is identical to the one retrieved from the name. * Add cl_collation tinyint column to track collation revisions, with an index. Add a $wgCollationVersion variable, initially set to 0 (for no collation). * Add cl_type column, an ENUM of ('page', 'subcat', 'file'). ENUMs are somewhat evil to change in MySQL, but this isn't likely to change. * Change existing indexes appropriately. * Write code to keep the new columns populated and use them on category page display. * Write a script to populate the new columns. * Write some trivial collation function and try to migrate to it.
I'm aiming to be done with this on Monday. There are still some outstanding issues I'm not sure about that I've outlined here, but they won't be much work to change later.