On Fri, Feb 22, 2008 at 5:59 AM, Samuel Wantman wantman@earthlink.net wrote:
While designing tables for categories, I hope that some thought be given into implementing category intersection. I suspect the process would be much faster with properly designed tables.
Unfortunately, no. The way to do this effectively is probably just to use a fulltext index. MySQL's fulltext system is not really acceptable for Wikimedia use (too slow, only works on MyISAM), so what MySQL tables exist is beside the point. What's needed is to write something up that doesn't use MySQL, the way we use Lucene for search. It could be done using Lucene, or Sphinx, or maybe with a non-fulltext solution like PostgreSQL.
At the least, if there is a count stored with each category, it would be possible to limit category intersections requests to categories meeting predetermined size requirements.
Probably not useful or efficient enough to bother with.
There are some other upgrades for categories that might also be considered at the same time. One frequent request is that maintenance categories not appear in pages with the other category listings. This would require a flag in the category table that could be set somehow, or perhaps with a new type of category name space (mcategory? wcategory?)
This could be a flag in the category table, yes ("do not display in the page category list").
On Fri, Feb 22, 2008 at 9:36 AM, David Gerard dgerard@gmail.com wrote:
Note that if you build it they will come, i.e. you will get the most ridiculously contorted tag queries being run on a regular basis. Hopefully it won't make the database crap itself if just OR and AND are allowed. Or even just AND.
It's been previously concluded that AND doesn't work efficiently on MySQL, last I heard.
Previous efforts in this direction have worked well in testing on Postgres, but foundered on MySQL not being up to the task. Since the chance of Wikimedia moving off MySQL is about zero, a solution that works on MySQL would be *wonderful*.
We don't have to move off MySQL, we just have to use a different system for this one feature. That's perfectly plausible; we use Lucene for search.
On Fri, Feb 22, 2008 at 10:02 AM, Thomas Dalton thomas.dalton@gmail.com wrote:
I seem to get everything wrong when it comes to databases, but I would expect OR to be pretty easy. You just get the list of the first category (which we do all the time), do it again for the next (equally easy), and remove duplicates (I'm not sure how easy that would be, but I would think at worst it's O(n.log(n)). It's AND that's difficult because you have to compare everything in two categories (which is probably O(n^2)). With OR, you can do one at a time.
Something like that. A UNION in MySQL is pretty fast. It would be O(log N) or so, in the total number of categories/total number of category links, same as any query that uses a B-tree. Obviously, there would be at least a linear factor in the size of the result set as well, but result sets can be kept small. You'd do something like
(SELECT cl_from, cl_sortkey FROM categorylinks WHERE cl_to='Presidents_of_the_United_States' ORDER BY cl_sortkey LIMIT 20) UNION (SELECT cl_from, cl_sortkey FROM categorylinks WHERE cl_to='United_States_Senators' ORDER BY cl_sortkey LIMIT 20) ORDER BY cl_sortkey LIMIT 20
to get a list of the first 20 American presidents/senators. It's a little worse than I had hoped, the toolserver shows a filesort on the union:
+----+--------------+---------------+------+-------------------------+------------+---------+-------+------+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+--------------+---------------+------+-------------------------+------------+---------+-------+------+--------------------------+ | 1 | PRIMARY | categorylinks | ref | cl_sortkey,cl_timestamp | cl_sortkey | 257 | const | 81 | Using where; Using index | | 2 | UNION | categorylinks | ref | cl_sortkey,cl_timestamp | cl_sortkey | 257 | const | 68 | Using where; Using index | | NULL | UNION RESULT | <union1,2> | ALL | NULL | NULL | NULL | NULL | NULL | Using filesort | +----+--------------+---------------+------+-------------------------+------------+---------+-------+------+--------------------------+
But it's not a huge deal, since you only need to sort 40 rows. It's O(N log N) in the result set. I had hoped it would be able to do a merge sort, which would be O(N) in the result set at least as far as copying is concerned, AFAIK.
Then again, I don't claim to be a great expert on databases either, so something above may be wrong.