On Thu, Jul 22, 2010 at 3:01 AM, Tim Starling <tstarling(a)wikimedia.org> wrote:
An alternative would be to add a column to the
categorylinks table,
say cl_type. It could be an ENUM or some short text type. Then the
index could be altered to include this field at the start of it.
Presumably the rationale for combining these two things into
cl_sortkey is to avoid a schema change, and to make the paging code
slightly simpler. But I worry that future generations of MediaWiki
developers will curse us for laziness and obfuscation.
One problem with this, though, that just occurred to me: won't it mean
that any existing users of cl_sortkey will suddenly not be using an
index? I see a whole bunch of extensions that do ORDER BY cl_sortkey.
Hacking up cl_sortkey to store weird stuff in a prefix will make them
sort things oddly, but they won't filesort large categories. I assume
we don't want to keep an extra index just for these marginal uses.
On Thu, Jul 22, 2010 at 11:02 PM, Tim Starling <tstarling(a)wikimedia.org> wrote:
No objection. That's mostly how it is already,
except when the file
doesn't exist but the description page does.
Okay, done in r69802.
Welsh is not really what I was thinking when I said
get statistics.
Well, it's a lot faster, particular on the toolserver. Wikimedia
database servers don't have to cope with ludicrously stupid queries
being run all the time. :)
So we're looking at 17 bytes per row for raw text,
and 8 bytes per row
for compacted text, plus 1 byte per row for the length byte. Overall,
assuming the lengths are the same across all namespaces, it would be
approximately 680 MB in the raw form for the English Wikipedia, and
presumably several times that for all wikis. Our English Wikipedia
core DB servers have between 700 GB and 2 TB of storage space, with
~450 GB currently in use. So the impact of adding an extra 1 GB or so
would be minimal.
So within the range of my previous estimate in relative terms, 50% to
75% space saved.
Truncating after the first letter would only save
about 260MB for the
entire English Wikipedia. And it would limit the applications. For
instance, it would prevent fast updates of the collation algorithm.
Instead we would have to reparse the pages. That could take weeks,
even with a dozen servers dedicated to the task.
Okay, I'll store the raw versions.
If we're going to have multiple locale-specific
collation algorithms
(and that seems likely), then it may make sense to add a collation ID
foreign key to the categorylinks table, to track updates. Sensible
sorting behaviour mid-way through an update is probably not feasible,
but we can at least make it possible to track the problem.
One way to have sensible sorting behavior midway through collation
(suggested by Philippe Verdy on bug 164, if I understood right, but at
least that's the inspiration) would be to have a cl_collation to track
this, and then extend the unique index on (cl_from, cl_to) to
(cl_from, cl_to, cl_collation), adjust all other indexes similarly
where necessary, and add WHERE cl_collation = 73 or whatever to all
the queries. Then when switching collations, we could have the code
start keeping them updated in parallel; do a batch job to add extra
rows for the new collation where they don't exist; and when that's
done, stop maintaining the old collation and DELETE it. Do you think
that's a good idea? How often do we expect to have to change the
collation?
We have zillions of lists, but category pages are by
far the most
visible and heavily-used, that's why so much work has been done on
making them look nice, and why so many people are complaining about
category sorting instead of [[Special:DeadendPages]] sorting.
Okay, so I'll make sure that the infrastructure is present for this to
work, including for languages that don't want it to turn it off.
The
uneven column issue is fixable using the multi-column layout feature
in CSS 3 and more recent versions of the major browsers.
We'd have to scratch the "C cont." labels at the top of each column,
though. Is that too much of an imposition on the page's prettiness?
:)
On Fri, Jul 23, 2010 at 12:40 AM, Krinkle <krinklemail(a)gmail.com> wrote:
Something that comes to mind is Wikimedia Commons.
Would the same category with
a different
uselang be sorted differently ?
No. That's not practical at all. It can only be per-category. (I
don't plan to implement per-category sort orders in this
implementation, unless I'm asked to.)
No objection either. However, make sure that the other
way around is blocked
too. Else one might
accidently move a page to the File-namespace without being able to get it
back.
Of course.
Okay, so I think I have enough feedback to write up an initial
implementation. I'll start on that now. To begin with I'll probably
just stick in a dummy collation function, like use "convert everything
to uppercase" and test it with $wgCapitalLinks false. Outline of my
initial implementation plans:
* Add $wgExperimentalCategorySort, default to false, for testing, so
that I can commit this to trunk (I don't like using branches).
* Add unindexed varchar(255) NULL cl_raw_sortkey column, to serve the
same function as the current cl_sortkey, except it will be NULL if the
raw sortkey is identical to the one retrieved from the name.
* Add cl_collation tinyint column to track collation revisions, with
an index. Add a $wgCollationVersion variable, initially set to 0 (for
no collation).
* Add cl_type column, an ENUM of ('page', 'subcat', 'file').
ENUMs
are somewhat evil to change in MySQL, but this isn't likely to change.
* Change existing indexes appropriately.
* Write code to keep the new columns populated and use them on
category page display.
* Write a script to populate the new columns.
* Write some trivial collation function and try to migrate to it.
I'm aiming to be done with this on Monday. There are still some
outstanding issues I'm not sure about that I've outlined here, but
they won't be much work to change later.