On Wed, Feb 20, 2008 at 11:52 PM, Andrew Garrett <andrew(a)epstone.net> wrote:
Actually, I've been looking into this recently
(It's a very frequent
personal request). I very much like Simetrical's idea of having a
separate category table
I got it from Rob, I think.
Domas tells
me that checking the number of items in a category is about a
five-second query for very big categories
I've observed this too, at
http://en.wikipedia.org/wiki/Special:Categories?offset=Living_peopla .
:)
So, if I were to implement this tomorrow, this is
what I would do
(feedback welcomed):
* Create a 'category' table starting off with a c_id, c_page_id, and
c_count. This would fit in very nicely with something Tim was talking
about last night - allowing flags to be put on categories (i.e. "don't
show this category on the article", "show icon Image:X in the top
corner of articles in this category", et cetera) which could be put in
a separate column of the category table.
Not c_page_id. It needs to be c_title. Remember, the corresponding
page may not exist. Also, why not have c_pages, c_images,
c_subcategories separately? We do display them separately, and it
just stores more info. Magnus' suggestion of separating out content
namespace from others might be a good idea too.
* On save of a category, or a page including that
category, insert the
requisite category entry (this would save a dirty great big migration
script).
* When a categorylinks item is INSERTED or DELETED on links-update
(the code is nice in that it only inserts/deletes those items which
have been added/removed), do the requisite incrementing/decrementing
on the category table.
I guess what you would do is UPDATE category SET c_count = c_count+1,
and then check number of affected rows to see if it worked. And if
not, you can regenerate the whole thing with an INSERT ... SELECT.
And on category view, you can try reading from the category table, and
if that fails read from the result set as usual. This seems sane: it
has no additional overhead (above a migration script) once the
migration is complete.
On the other hand, it means we have to keep and maintain these
obsolescent bits of code around forever. It might be best to have the
default update.php do a one-shot migration script; and on Wikimedia,
push the category table updating code (in links-update) live first,
then start running the migration gradually in the background. The
category table updating code wouldn't have to check whether it
actually worked or not, it would just affect zero rows if it didn't.
Once Wikimedia was updated, of course, we would add features that
depended on the count being accurate.
On Thu, Feb 21, 2008 at 2:42 AM, Jim Hu <jimhu(a)tamu.edu> wrote:
That would be nice. I haven't kept up with the
last couple of
versions, but there used to also be a problem with subcategories not
showing up if they didn't sort into the initial 200 in the query
limit. If this is going to get fixed, it would be nice if there was a
fix for that too, if it's not already addressed...we're still on 1.9x
This is a separate issue (which is still quite true). It's because we
don't page subcategories, articles, and images separately, although we
display them separately. We retrieve the first 200 category members
by sort key, and only then do we pigeonhole them into the correct part
of the page.
There are various methods that have been discussed to deal with this.
We could, for instance, include a null or other low-sorting
ASCIIbetical character in the sort keys of subcategories. That would
be the quick and ugly way, and would work okay, but would still
possibly not work as expected -- subcategories would still be paged
with everything else, just always on the first page, and not
necessarily even that if someone used some really weird sort keys.
Arguably, a better way would be to have a separate subcategorylinks
table, keep category articles' category inclusions out of the
categorylinks table, and page them totally separately. This is harder
and also fragments very similar info into separate tables. A similar
solution would be to add a one-bit field (CHAR(0) NULL?! :D) to
categorylinks, index it, and use that to indicate whether cl_from is a
category or not. This bloats indexes a bit.
Some variant of one of these latter two might also be used for
uploaded files, if we want to page those separately too. If we paged
subcategories, articles, and files separately, then paging would match
the layout of the page: none of the three sections of the page
"interferes" with any other. This might be the most intuitive. It
would also make sense to have somewhat smaller numbers of images
displayed per page, if they're thumbnailed, which is impossible if
they're paged together with articles.