On Wed, Feb 20, 2008 at 11:52 PM, Andrew Garrett andrew@epstone.net wrote:
Actually, I've been looking into this recently (It's a very frequent personal request). I very much like Simetrical's idea of having a separate category table
I got it from Rob, I think.
Domas tells me that checking the number of items in a category is about a five-second query for very big categories
I've observed this too, at http://en.wikipedia.org/wiki/Special:Categories?offset=Living_peopla . :)
So, if I were to implement this tomorrow, this is what I would do (feedback welcomed):
- Create a 'category' table starting off with a c_id, c_page_id, and
c_count. This would fit in very nicely with something Tim was talking about last night - allowing flags to be put on categories (i.e. "don't show this category on the article", "show icon Image:X in the top corner of articles in this category", et cetera) which could be put in a separate column of the category table.
Not c_page_id. It needs to be c_title. Remember, the corresponding page may not exist. Also, why not have c_pages, c_images, c_subcategories separately? We do display them separately, and it just stores more info. Magnus' suggestion of separating out content namespace from others might be a good idea too.
- On save of a category, or a page including that category, insert the
requisite category entry (this would save a dirty great big migration script).
- When a categorylinks item is INSERTED or DELETED on links-update
(the code is nice in that it only inserts/deletes those items which have been added/removed), do the requisite incrementing/decrementing on the category table.
I guess what you would do is UPDATE category SET c_count = c_count+1, and then check number of affected rows to see if it worked. And if not, you can regenerate the whole thing with an INSERT ... SELECT. And on category view, you can try reading from the category table, and if that fails read from the result set as usual. This seems sane: it has no additional overhead (above a migration script) once the migration is complete.
On the other hand, it means we have to keep and maintain these obsolescent bits of code around forever. It might be best to have the default update.php do a one-shot migration script; and on Wikimedia, push the category table updating code (in links-update) live first, then start running the migration gradually in the background. The category table updating code wouldn't have to check whether it actually worked or not, it would just affect zero rows if it didn't. Once Wikimedia was updated, of course, we would add features that depended on the count being accurate.
On Thu, Feb 21, 2008 at 2:42 AM, Jim Hu jimhu@tamu.edu wrote:
That would be nice. I haven't kept up with the last couple of versions, but there used to also be a problem with subcategories not showing up if they didn't sort into the initial 200 in the query limit. If this is going to get fixed, it would be nice if there was a fix for that too, if it's not already addressed...we're still on 1.9x
This is a separate issue (which is still quite true). It's because we don't page subcategories, articles, and images separately, although we display them separately. We retrieve the first 200 category members by sort key, and only then do we pigeonhole them into the correct part of the page.
There are various methods that have been discussed to deal with this. We could, for instance, include a null or other low-sorting ASCIIbetical character in the sort keys of subcategories. That would be the quick and ugly way, and would work okay, but would still possibly not work as expected -- subcategories would still be paged with everything else, just always on the first page, and not necessarily even that if someone used some really weird sort keys.
Arguably, a better way would be to have a separate subcategorylinks table, keep category articles' category inclusions out of the categorylinks table, and page them totally separately. This is harder and also fragments very similar info into separate tables. A similar solution would be to add a one-bit field (CHAR(0) NULL?! :D) to categorylinks, index it, and use that to indicate whether cl_from is a category or not. This bloats indexes a bit.
Some variant of one of these latter two might also be used for uploaded files, if we want to page those separately too. If we paged subcategories, articles, and files separately, then paging would match the layout of the page: none of the three sections of the page "interferes" with any other. This might be the most intuitive. It would also make sense to have somewhat smaller numbers of images displayed per page, if they're thumbnailed, which is impossible if they're paged together with articles.