Category tables and Category Intersection

List overview All Threads
Download

newer

older

Changes for Transwiki /...

problems with tex and the wikipdf...

Samuel Wantman

22 Feb 2008 22 Feb '08

10:59 a.m.

While designing tables for categories, I hope that some thought be given into implementing category intersection. I suspect the process would be much faster with properly designed tables. In previous discussions, the server load needed to intersect very large categories with little or no common members was enough to prevent the implementation of category intersection. At the least, if there is a count stored with each category, it would be possible to limit category intersections requests to categories meeting predetermined size requirements. There are some other upgrades for categories that might also be considered at the same time. One frequent request is that maintenance categories not appear in pages with the other category listings. This would require a flag in the category table that could be set somehow, or perhaps with a new type of category name space (mcategory? wcategory?) Samuel Wantman en:User:Sam

Show replies by date

David Gerard

22 Feb 22 Feb

2:36 p.m.

On 22/02/2008, Samuel Wantman <wantman(a)earthlink.net> wrote:

...

While designing tables for categories, I hope that some thought be given into implementing category intersection. I suspect the process would be much faster with properly designed tables.

OH GOD YES PLEASE say all of us on Commons. We want categories usable more like tags - where you could attach tens of broad tags, and subcategories would form from queries on intersections of them. This would also solve a lot of problems with ridiculously specific sub-sub-sub-categories on en:wp. Note that if you build it they will come, i.e. you will get the most ridiculously contorted tag queries being run on a regular basis. Hopefully it won't make the database crap itself if just OR and AND are allowed. Or even just AND. Previous efforts in this direction have worked well in testing on Postgres, but foundered on MySQL not being up to the task. Since the chance of Wikimedia moving off MySQL is about zero, a solution that works on MySQL would be *wonderful*. - d.

Thomas Dalton

3:02 p.m.

...

Note that if you build it they will come, i.e. you will get the most ridiculously contorted tag queries being run on a regular basis. Hopefully it won't make the database crap itself if just OR and AND are allowed. Or even just AND.

I seem to get everything wrong when it comes to databases, but I would expect OR to be pretty easy. You just get the list of the first category (which we do all the time), do it again for the next (equally easy), and remove duplicates (I'm not sure how easy that would be, but I would think at worst it's O(n.log(n)). It's AND that's difficult because you have to compare everything in two categories (which is probably O(n^2)). With OR, you can do one at a time.

David Gerard

3:09 p.m.

On 22/02/2008, Thomas Dalton <thomas.dalton(a)gmail.com> wrote:

...

> Note that if you build it they will come, i.e. you will get the most > ridiculously contorted tag queries being run on a regular basis. > Hopefully it won't make the database crap itself if just OR and AND > are allowed. Or even just AND.

...

Most people would want AND, i.e. the intersection of two broad tags. - d.

Tim Starling

23 Feb 23 Feb

2:23 a.m.

David Gerard wrote:

...

Previous efforts in this direction have worked well in testing on Postgres, but foundered on MySQL not being up to the task. Since the chance of Wikimedia moving off MySQL is about zero, a solution that works on MySQL would be *wonderful*.

There's no need to move off MySQL in order to move on to PostgreSQL. We already have data stored in Lucene, Berkley DB and Memcached. Although it would be easier for system administration if we could use an existing installation of something, I certainly don't see it as a killer to a PostgreSQL solution. But it seems to me, if you look at data storage software already in use, Lucene is much better suited for computing intersections than MySQL. -- Tim Starling

Magnus Manske

25 Feb 25 Feb

1:52 p.m.

On Sat, Feb 23, 2008 at 2:23 AM, Tim Starling <tstarling(a)wikimedia.org> wrote:

...

David Gerard wrote:

If, at one point, we have the number of articles per category readily available, can't we just exclude large categories? The rest should work nicely with joins, subqueries, etc. It might even be fastest to get two full lists of category members (should cache well), then do the join in PHP. We'll lose the ability to do intersections with "living people", maintenance and license categories; but, compared to "no intersections" now, I could live with it for a while :-) Magnus

Simetrical

5:38 p.m.

On Mon, Feb 25, 2008 at 8:52 AM, Magnus Manske <magnusmanske(a)googlemail.com> wrote:

...

That removes most of the utility of this feature. The utility would be in getting things like a list of artists born in America in 1985, which would be an intersection of all artists, all people born in America, and all people born in 1985 -- all very large categories. And whatever you set the number at, the performance of this method is going to be bordering on unacceptable (by definition, since whatever you set the number at is ipso facto the highest acceptable value).

Simetrical

22 Feb 22 Feb

5:25 p.m.

On Fri, Feb 22, 2008 at 5:59 AM, Samuel Wantman <wantman(a)earthlink.net> wrote:

...

While designing tables for categories, I hope that some thought be given into implementing category intersection. I suspect the process would be much faster with properly designed tables.

Unfortunately, no. The way to do this effectively is probably just to use a fulltext index. MySQL's fulltext system is not really acceptable for Wikimedia use (too slow, only works on MyISAM), so what MySQL tables exist is beside the point. What's needed is to write something up that doesn't use MySQL, the way we use Lucene for search. It could be done using Lucene, or Sphinx, or maybe with a non-fulltext solution like PostgreSQL.

...

At the least, if there is a count stored with each category, it would be possible to limit category intersections requests to categories meeting predetermined size requirements.

Probably not useful or efficient enough to bother with.

...

There are some other upgrades for categories that might also be considered at the same time. One frequent request is that maintenance categories not appear in pages with the other category listings. This would require a flag in the category table that could be set somehow, or perhaps with a new type of category name space (mcategory? wcategory?)

This could be a flag in the category table, yes ("do not display in the page category list"). On Fri, Feb 22, 2008 at 9:36 AM, David Gerard <dgerard(a)gmail.com> wrote:

...

It's been previously concluded that AND doesn't work efficiently on MySQL, last I heard.

...

We don't have to move off MySQL, we just have to use a different system for this one feature. That's perfectly plausible; we use Lucene for search. On Fri, Feb 22, 2008 at 10:02 AM, Thomas Dalton <thomas.dalton(a)gmail.com> wrote:

...

Something like that. A UNION in MySQL is pretty fast. It would be O(log N) or so, in the total number of categories/total number of category links, same as any query that uses a B-tree. Obviously, there would be at least a linear factor in the size of the result set as well, but result sets can be kept small. You'd do something like (SELECT cl_from, cl_sortkey FROM categorylinks WHERE cl_to='Presidents_of_the_United_States' ORDER BY cl_sortkey LIMIT 20) UNION (SELECT cl_from, cl_sortkey FROM categorylinks WHERE cl_to='United_States_Senators' ORDER BY cl_sortkey LIMIT 20) ORDER BY cl_sortkey LIMIT 20 to get a list of the first 20 American presidents/senators. It's a little worse than I had hoped, the toolserver shows a filesort on the union: +----+--------------+---------------+------+-------------------------+------------+---------+-------+------+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+--------------+---------------+------+-------------------------+------------+---------+-------+------+--------------------------+ | 1 | PRIMARY | categorylinks | ref | cl_sortkey,cl_timestamp | cl_sortkey | 257 | const | 81 | Using where; Using index | | 2 | UNION | categorylinks | ref | cl_sortkey,cl_timestamp | cl_sortkey | 257 | const | 68 | Using where; Using index | | NULL | UNION RESULT | <union1,2> | ALL | NULL | NULL | NULL | NULL | NULL | Using filesort | +----+--------------+---------------+------+-------------------------+------------+---------+-------+------+--------------------------+ But it's not a huge deal, since you only need to sort 40 rows. It's O(N log N) in the result set. I had hoped it would be able to do a merge sort, which would be O(N) in the result set at least as far as copying is concerned, AFAIK. Then again, I don't claim to be a great expert on databases either, so something above may be wrong.

Brion Vibber

8:14 p.m.

Simetrical wrote:

...

On Fri, Feb 22, 2008 at 5:59 AM, Samuel Wantman <wantman(a)earthlink.net> wrote:

While designing tables for categories, I hope that some thought be given into implementing category intersection. I suspect the process would be much faster with properly designed tables.

Unfortunately, no. The way to do this effectively is probably just to use a fulltext index. MySQL's fulltext system is not really acceptable for Wikimedia use (too slow, only works on MyISAM),

"Too slow" depends on actual usage. Unless you've benchmarked it in that usage, I wouldn't toss that word around yet. Being relegated to a MyISAM table isn't necessarily an issue, either. MyISAM has some issues, but there's nothing preventing us from using it when and where required. The main issues are: * May require error checking after a server crash (Annoying, but a) can be automated and b) doesn't have to interfere with everything else.) * Doesn't participate in transactions (Probably not that big a deal. Certainly no worse than creating yet-another external server.) * Reade and writes can hold conflicting locks, and you may have to wait on things sometimes. (Probably not that bad in this kind of case.) -- brion vibber (brion @ wikimedia.org)

Simetrical

9:09 p.m.

On Fri, Feb 22, 2008 at 3:14 PM, Brion Vibber <brion(a)wikimedia.org> wrote:

...

"Too slow" depends on actual usage. Unless you've benchmarked it in that usage, I wouldn't toss that word around yet.

Well, maybe I'm biased with its total failure to work reasonably on my own web server. :)

...

* Reade and writes can hold conflicting locks, and you may have to wait on things sometimes. (Probably not that bad in this kind of case.)

Isn't it? On my own server with MySQL fulltext (using vBulletin; I just switched to Sphinx to kill this problem) I would commonly see searches taking 20 or 30 seconds, or more, although most took much less. Possibly the searches or data sets were much worse cases than what we're discussing here, I don't know. What would happen to my forum, anyway, is that a routine update would get queued shortly after the select started, and that would in turn block all selects as well as all updates, for 20 or 30 seconds or more. Any pending update blocks selects, even if it it itself is blocked by a read lock, to avoid starvation of updates. This was on a table with only around 2.2 million rows, albeit much larger ones than category intersection would need (forum posts) -- so actually it could be comparable to the amount of data in a Wikipedia categories table, at one row per article, although it would have somewhat fewer rows. My recollection from Aerik's testing was that performance for MySQL fulltext was kind of marginal, didn't look like it would necessarily stand up under load. Looking back through the archives a bit, Jens seemed to agree a priori with my (more recent) assessment of table-level locking: https://lists.wikimedia.org/mailman/htdig/wikitech-l/2006-December/028002.h… Here's some of Aerik's testing, with a benchmark: https://lists.wikimedia.org/mailman/htdig/wikitech-l/2006-December/028081.h… Which gives a third of a second. Given one very rough estimate of five category changes per second (and that was some time ago), that would all but eliminate any concurrency in the selects: you'd have a couple of selects, then an update would get queued, everything would wait, a couple of updates would run, a few more selects, another update and everything waits again . . . And if the table crashed, you'd have to rebuild it, which could take an hour or a day. But hey, we have a working proof-of-concept. If anyone wants to try getting it running, I'm not objecting. My suspicion is that something reliable and non-blocking like Lucene or maybe PostgreSQL would be better, but I don't claim any great knowledge about databases.

5906

days inactive

5909

days old

wikitech-l@lists.wikimedia.org

Manage subscription

9 comments

7 participants

tags (0)

participants (7)

Brion Vibber
David Gerard
Magnus Manske
Samuel Wantman
Simetrical
Thomas Dalton
Tim Starling