Re: [Wikitech-l] Category tables and Category Intersection

22 Feb 2008


      On Fri, Feb 22, 2008 at 5:59 AM, Samuel Wantman wantman@earthlink.net wrote:
...
While designing tables for categories, I hope that some thought be given
 into implementing category intersection.  I suspect the process would be
 much faster with properly designed tables.
Unfortunately, no.  The way to do this effectively is probably just to
use a fulltext index.  MySQL's fulltext system is not really
acceptable for Wikimedia use (too slow, only works on MyISAM), so what
MySQL tables exist is beside the point.  What's needed is to write
something up that doesn't use MySQL, the way we use Lucene for search.
 It could be done using Lucene, or Sphinx, or maybe with a
non-fulltext solution like PostgreSQL.
...
At the least, if there is a count stored with each
 category, it would be possible to limit category intersections requests
 to categories meeting predetermined size requirements.
Probably not useful or efficient enough to bother with.
...
There are some other upgrades for categories that might also be
 considered at the same time.  One frequent request is that maintenance
 categories not appear in pages with the other category listings.  This
 would require a flag in the category table that could be set somehow, or
 perhaps with a new type of category name space (mcategory? wcategory?)
This could be a flag in the category table, yes ("do not display in
the page category list").
On Fri, Feb 22, 2008 at 9:36 AM, David Gerard dgerard@gmail.com wrote:
...
Note that if you build it they will come, i.e. you will get the most
 ridiculously contorted tag queries being run on a regular basis.
 Hopefully it won't make the database crap itself if just OR and AND
 are allowed. Or even just AND.
It's been previously concluded that AND doesn't work efficiently on
MySQL, last I heard.
...
Previous efforts in this direction have worked well in testing on
 Postgres, but foundered on MySQL not being up to the task. Since the
 chance of Wikimedia moving off MySQL is about zero, a solution that
 works on MySQL would be *wonderful*.
We don't have to move off MySQL, we just have to use a different
system for this one feature.  That's perfectly plausible; we use
Lucene for search.
On Fri, Feb 22, 2008 at 10:02 AM, Thomas Dalton thomas.dalton@gmail.com wrote:
...
I seem to get everything wrong when it comes to databases, but I would
 expect OR to be pretty easy. You just get the list of the first
 category (which we do all the time), do it again for the next (equally
 easy), and remove duplicates (I'm not sure how easy that would be, but
 I would think at worst it's O(n.log(n)). It's AND that's difficult
 because you have to compare everything in two categories (which is
 probably O(n^2)). With OR, you can do one at a time.
Something like that.  A UNION in MySQL is pretty fast.  It would be
O(log N) or so, in the total number of categories/total number of
category links, same as any query that uses a B-tree.  Obviously,
there would be at least a linear factor in the size of the result set
as well, but result sets can be kept small.  You'd do something like
(SELECT cl_from, cl_sortkey FROM categorylinks WHERE
cl_to='Presidents_of_the_United_States' ORDER BY cl_sortkey LIMIT 20)
UNION
(SELECT cl_from, cl_sortkey FROM categorylinks WHERE
cl_to='United_States_Senators' ORDER BY cl_sortkey LIMIT 20)
ORDER BY cl_sortkey LIMIT 20
to get a list of the first 20 American presidents/senators.  It's a
little worse than I had hoped, the toolserver shows a filesort on the
union:
+----+--------------+---------------+------+-------------------------+------------+---------+-------+------+--------------------------+
| id | select_type  | table         | type | possible_keys           |
key        | key_len | ref   | rows | Extra                    |
+----+--------------+---------------+------+-------------------------+------------+---------+-------+------+--------------------------+
|  1 | PRIMARY      | categorylinks | ref  | cl_sortkey,cl_timestamp |
cl_sortkey | 257     | const |   81 | Using where; Using index |
|  2 | UNION        | categorylinks | ref  | cl_sortkey,cl_timestamp |
cl_sortkey | 257     | const |   68 | Using where; Using index |
| NULL | UNION RESULT | <union1,2>    | ALL  | NULL
| NULL       | NULL    | NULL  | NULL | Using filesort           |
+----+--------------+---------------+------+-------------------------+------------+---------+-------+------+--------------------------+
But it's not a huge deal, since you only need to sort 40 rows.  It's
O(N log N) in the result set.  I had hoped it would be able to do a
merge sort, which would be O(N) in the result set at least as far as
copying is concerned, AFAIK.
Then again, I don't claim to be a great expert on databases either, so
something above may be wrong.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Category tables and Category Intersection