On Fri, Feb 22, 2008 at 5:59 AM, Samuel Wantman <wantman(a)earthlink.net> wrote:
While designing tables for categories, I hope that
some thought be given
into implementing category intersection. I suspect the process would be
much faster with properly designed tables.
Unfortunately, no. The way to do this effectively is probably just to
use a fulltext index. MySQL's fulltext system is not really
acceptable for Wikimedia use (too slow, only works on MyISAM), so what
MySQL tables exist is beside the point. What's needed is to write
something up that doesn't use MySQL, the way we use Lucene for search.
It could be done using Lucene, or Sphinx, or maybe with a
non-fulltext solution like PostgreSQL.
At the least, if there is a count stored with each
category, it would be possible to limit category intersections requests
to categories meeting predetermined size requirements.
Probably not useful or efficient enough to bother with.
There are some other upgrades for categories that
might also be
considered at the same time. One frequent request is that maintenance
categories not appear in pages with the other category listings. This
would require a flag in the category table that could be set somehow, or
perhaps with a new type of category name space (mcategory? wcategory?)
This could be a flag in the category table, yes ("do not display in
the page category list").
On Fri, Feb 22, 2008 at 9:36 AM, David Gerard <dgerard(a)gmail.com> wrote:
Note that if you build it they will come, i.e. you
will get the most
ridiculously contorted tag queries being run on a regular basis.
Hopefully it won't make the database crap itself if just OR and AND
are allowed. Or even just AND.
It's been previously concluded that AND doesn't work efficiently on
MySQL, last I heard.
Previous efforts in this direction have worked well
in testing on
Postgres, but foundered on MySQL not being up to the task. Since the
chance of Wikimedia moving off MySQL is about zero, a solution that
works on MySQL would be *wonderful*.
We don't have to move off MySQL, we just have to use a different
system for this one feature. That's perfectly plausible; we use
Lucene for search.
On Fri, Feb 22, 2008 at 10:02 AM, Thomas Dalton <thomas.dalton(a)gmail.com> wrote:
I seem to get everything wrong when it comes to
databases, but I would
expect OR to be pretty easy. You just get the list of the first
category (which we do all the time), do it again for the next (equally
easy), and remove duplicates (I'm not sure how easy that would be, but
I would think at worst it's O(n.log(n)). It's AND that's difficult
because you have to compare everything in two categories (which is
probably O(n^2)). With OR, you can do one at a time.
Something like that. A UNION in MySQL is pretty fast. It would be
O(log N) or so, in the total number of categories/total number of
category links, same as any query that uses a B-tree. Obviously,
there would be at least a linear factor in the size of the result set
as well, but result sets can be kept small. You'd do something like
(SELECT cl_from, cl_sortkey FROM categorylinks WHERE
cl_to='Presidents_of_the_United_States' ORDER BY cl_sortkey LIMIT 20)
UNION
(SELECT cl_from, cl_sortkey FROM categorylinks WHERE
cl_to='United_States_Senators' ORDER BY cl_sortkey LIMIT 20)
ORDER BY cl_sortkey LIMIT 20
to get a list of the first 20 American presidents/senators. It's a
little worse than I had hoped, the toolserver shows a filesort on the
union:
+----+--------------+---------------+------+-------------------------+------------+---------+-------+------+--------------------------+
| id | select_type | table | type | possible_keys |
key | key_len | ref | rows | Extra |
+----+--------------+---------------+------+-------------------------+------------+---------+-------+------+--------------------------+
| 1 | PRIMARY | categorylinks | ref | cl_sortkey,cl_timestamp |
cl_sortkey | 257 | const | 81 | Using where; Using index |
| 2 | UNION | categorylinks | ref | cl_sortkey,cl_timestamp |
cl_sortkey | 257 | const | 68 | Using where; Using index |
| NULL | UNION RESULT | <union1,2> | ALL | NULL
| NULL | NULL | NULL | NULL | Using filesort |
+----+--------------+---------------+------+-------------------------+------------+---------+-------+------+--------------------------+
But it's not a huge deal, since you only need to sort 40 rows. It's
O(N log N) in the result set. I had hoped it would be able to do a
merge sort, which would be O(N) in the result set at least as far as
copying is concerned, AFAIK.
Then again, I don't claim to be a great expert on databases either, so
something above may be wrong.