On 23/07/10 02:34, Aryeh Gregor wrote:
On Thu, Jul 22, 2010 at 3:01 AM, Tim Starling tstarling@wikimedia.org wrote:
This restriction is enforced by Title::isValidMoveOperation().
Any objections to changing this so files can't be moved over non-files or vice versa?
No objection. That's mostly how it is already, except when the file doesn't exist but the description page does.
Since we won't be sorting on the plain text form anymore, we could use some tricks to save space. For instance, if the sort key is the same as the article title, we could store NULL instead of another copy of the article title. That should save 95% or so.
It doesn't seem like it would save nearly that much. On the Welsh Wikipedia (small enough database to be manageable), I get the following:
Welsh is not really what I was thinking when I said get statistics. On the English Wikipedia (db38):
mysql> show table status like 'categorylinks'\G *************************** 1. row *************************** Name: categorylinks Engine: InnoDB Version: 10 Row_format: Compact Rows: 38875439 Avg_row_length: 161 Data_length: 6271123456 Max_data_length: 0 Index_length: 7946960896 Data_free: 7340032 Auto_increment: NULL Create_time: 2010-05-24 11:29:52 Update_time: NULL Check_time: NULL Collation: binary Checksum: NULL Create_options: Comment: 1 row in set (0.15 sec)
SELECT count(*), sum(length(cl_sortkey)) as raw_length, sum( if(REPLACE(cl_sortkey, ' ', '_') = page_title, 0, length(cl_sortkey) ) ) as compact_length FROM categorylinks,page WHERE cl_from=page_id and page_namespace=0 and page_id % 10 = 0
*************************** 1. row *************************** count(*): 1957629 raw_length: 34177525 compact_length: 14857665 1 row in set (19 min 26.05 sec)
So we're looking at 17 bytes per row for raw text, and 8 bytes per row for compacted text, plus 1 byte per row for the length byte. Overall, assuming the lengths are the same across all namespaces, it would be approximately 680 MB in the raw form for the English Wikipedia, and presumably several times that for all wikis. Our English Wikipedia core DB servers have between 700 GB and 2 TB of storage space, with ~450 GB currently in use. So the impact of adding an extra 1 GB or so would be minimal.
No doubt Domas will complain anyway, but without developers adding new features, I figure his volunteer DBA work would get very boring.
It's still not at all clear to me that saving a raw copy in the database is worth it. If we really need sectioning by first letter on category pages, we could save the first letter instead, and leave that NULL when it's the same as the first letter of the page title (all of this for some locale-specific definition of "first letter"). But I don't know if we need that.
Truncating after the first letter would only save about 260MB for the entire English Wikipedia. And it would limit the applications. For instance, it would prevent fast updates of the collation algorithm. Instead we would have to reparse the pages. That could take weeks, even with a dozen servers dedicated to the task.
This whole problems arises for sortkey changes generally. It will be just as much of a problem when going to a new sortkey type (based on CLDR or whatever). The only way to avoid it is to create a new column, populate it while maintaining both columns at once, start using the new column once it's fully populated, and then drop the old column. That seems excessive.
If we're going to have multiple locale-specific collation algorithms (and that seems likely), then it may make sense to add a collation ID foreign key to the categorylinks table, to track updates. Sensible sorting behaviour mid-way through an update is probably not feasible, but we can at least make it possible to track the problem.
On Thu, Jul 22, 2010 at 5:34 AM, David Gerard dgerard@gmail.com wrote:
Please don't remove the feature where the first letter of the sort key is displayed in the rendered category page, and if necessary add what it takes to keep it.
There are scripts where this will be a hard problem, but it's still much-used and much-loved in those where it isn't.
Is it? What use does it serve? We don't have it for any other type of list. We have zillions of types of page lists, and category pages are the only ones that have the first letter displayed. It makes the columns uneven, and is completely crazy for some scripts (like CJK, AFAICT).
We have zillions of lists, but category pages are by far the most visible and heavily-used, that's why so much work has been done on making them look nice, and why so many people are complaining about category sorting instead of [[Special:DeadendPages]] sorting.
The CJK issue could be fixed by making the feature optional. The uneven column issue is fixable using the multi-column layout feature in CSS 3 and more recent versions of the major browsers.
-- Tim Starling