On 23/07/10 02:34, Aryeh Gregor wrote:
On Thu, Jul 22, 2010 at 3:01 AM, Tim Starling
<tstarling(a)wikimedia.org> wrote:
This restriction is enforced by
Title::isValidMoveOperation().
Any objections to changing this so files can't be moved over non-files
or vice versa?
No objection. That's mostly how it is already, except when the file
doesn't exist but the description page does.
Since we
won't be sorting on the plain text form anymore, we could use
some tricks to save space. For instance, if the sort key is the same
as the article title, we could store NULL instead of another copy of
the article title. That should save 95% or so.
It doesn't seem like it would save nearly that much. On the Welsh
Wikipedia (small enough database to be manageable), I get the
following:
Welsh is not really what I was thinking when I said get statistics. On
the English Wikipedia (db38):
mysql> show table status like 'categorylinks'\G
*************************** 1. row ***************************
Name: categorylinks
Engine: InnoDB
Version: 10
Row_format: Compact
Rows: 38875439
Avg_row_length: 161
Data_length: 6271123456
Max_data_length: 0
Index_length: 7946960896
Data_free: 7340032
Auto_increment: NULL
Create_time: 2010-05-24 11:29:52
Update_time: NULL
Check_time: NULL
Collation: binary
Checksum: NULL
Create_options:
Comment:
1 row in set (0.15 sec)
SELECT
count(*),
sum(length(cl_sortkey)) as raw_length,
sum( if(REPLACE(cl_sortkey, ' ', '_') = page_title,
0, length(cl_sortkey) ) ) as compact_length
FROM categorylinks,page
WHERE
cl_from=page_id and
page_namespace=0 and
page_id % 10 = 0
*************************** 1. row ***************************
count(*): 1957629
raw_length: 34177525
compact_length: 14857665
1 row in set (19 min 26.05 sec)
So we're looking at 17 bytes per row for raw text, and 8 bytes per row
for compacted text, plus 1 byte per row for the length byte. Overall,
assuming the lengths are the same across all namespaces, it would be
approximately 680 MB in the raw form for the English Wikipedia, and
presumably several times that for all wikis. Our English Wikipedia
core DB servers have between 700 GB and 2 TB of storage space, with
~450 GB currently in use. So the impact of adding an extra 1 GB or so
would be minimal.
No doubt Domas will complain anyway, but without developers adding new
features, I figure his volunteer DBA work would get very boring.
It's still not at all clear to me that saving a
raw copy in the
database is worth it. If we really need sectioning by first letter on
category pages, we could save the first letter instead, and leave that
NULL when it's the same as the first letter of the page title (all of
this for some locale-specific definition of "first letter"). But I
don't know if we need that.
Truncating after the first letter would only save about 260MB for the
entire English Wikipedia. And it would limit the applications. For
instance, it would prevent fast updates of the collation algorithm.
Instead we would have to reparse the pages. That could take weeks,
even with a dozen servers dedicated to the task.
This whole problems arises for sortkey changes
generally. It will be
just as much of a problem when going to a new sortkey type (based on
CLDR or whatever). The only way to avoid it is to create a new
column, populate it while maintaining both columns at once, start
using the new column once it's fully populated, and then drop the old
column. That seems excessive.
If we're going to have multiple locale-specific collation algorithms
(and that seems likely), then it may make sense to add a collation ID
foreign key to the categorylinks table, to track updates. Sensible
sorting behaviour mid-way through an update is probably not feasible,
but we can at least make it possible to track the problem.
On Thu, Jul 22, 2010 at 5:34 AM, David Gerard
<dgerard(a)gmail.com> wrote:
Please don't remove the feature where the
first letter of the sort key
is displayed in the rendered category page, and if necessary add what
it takes to keep it.
There are scripts where this will be a hard problem, but it's still
much-used and much-loved in those where it isn't.
Is it? What use does it serve? We don't have it for any other type
of list. We have zillions of types of page lists, and category pages
are the only ones that have the first letter displayed. It makes the
columns uneven, and is completely crazy for some scripts (like CJK,
AFAICT).
We have zillions of lists, but category pages are by far the most
visible and heavily-used, that's why so much work has been done on
making them look nice, and why so many people are complaining about
category sorting instead of [[Special:DeadendPages]] sorting.
The CJK issue could be fixed by making the feature optional. The
uneven column issue is fixable using the multi-column layout feature
in CSS 3 and more recent versions of the major browsers.
-- Tim Starling