Okay, I've got some better numbers (and a better method of building the category intersections table from an existing database).
I build a table with 2.6 million records, all the pages from the snapshot I had. It's 90% populated with categories (this will make more sense in a minute) and I'm still getting good results from intersections queries. For instance:
Showing rows 0 - 1 (2 total, Query took 0.3459 sec) SQL query: SELECT * FROM `pageindex` WHERE MATCH (catlist ) AGAINST ( '+Living_People +People_from_Maine +1956_births' IN BOOLEAN MODE ) LIMIT 0 , 30
(which has two results, btw, Cynthia McFadden and David Kelley).
This takes consistently a third of a second. Not bad. And that's for 2.6million rows. I think we should restrict it to just current articles in the main namespace, but that's my opinion.
I figured out a better way to populate the table, too:
UPDATE pageindex,categorylinks SET pageindex.catlist=CONCAT_WS(' ', pageindex.catlist,categorylinks.cl_to) WHERE ( pageindex.pageid=categorylinks.cl_from AND INSTR(pageindex.catlist, categorylinks.cl_to) =0)
You have to run this query for a number of times equal to the greatest number of distinct categories a page has. Most pages have less than 7, but a few apparently have outrageous numbers. After doing a lot of searching around for a solution to concatenating multiple rows into one value, I think this is pretty good.
Aerik
-----Original Message----- From: wikitech-l-bounces@wikimedia.org [mailto:wikitech-l-bounces@wikimedia.org] On Behalf Of Aerik Sylvan Sent: 14 December 2006 08:06 To: wikitech-l@wikimedia.org Subject: Re: [Wikitech-l] Category Intersections (again)
Okay, I've got some better numbers (and a better method of building the category intersections table from an existing database).
I build a table with 2.6 million records, all the pages from the snapshot I had. It's 90% populated with categories (this will make more sense in a minute) and I'm still getting good results from intersections queries. For instance:
Showing rows 0 - 1 (2 total, Query took 0.3459 sec) SQL query: SELECT * FROM `pageindex` WHERE MATCH (catlist ) AGAINST ( '+Living_People +People_from_Maine +1956_births' IN BOOLEAN MODE ) LIMIT 0 , 30
(which has two results, btw, Cynthia McFadden and David Kelley).
This takes consistently a third of a second. Not bad. And that's for 2.6million rows. I think we should restrict it to just current articles in the main namespace, but that's my opinion.
I figured out a better way to populate the table, too:
UPDATE pageindex,categorylinks SET pageindex.catlist=CONCAT_WS(' ', pageindex.catlist,categorylinks.cl_to) WHERE ( pageindex.pageid=categorylinks.cl_from AND INSTR(pageindex.catlist, categorylinks.cl_to) =0)
You have to run this query for a number of times equal to the greatest number of distinct categories a page has. Most pages have less than 7, but a few apparently have outrageous numbers. After doing a lot of searching around for a solution to concatenating multiple rows into one value, I think this is pretty good.
MySQL has GROUP_CONCAT() if that's what your after.
Jared
Jared Williams <jared.williams1@...> writes:
MySQL has GROUP_CONCAT() if that's what your after.
Jared
Whoo-hoo!! Thanks! It's amazing (and a little sad) that I didn't find this while googling around for a solution.
Anybody else got any feedback about this proposal? I'm taking the lack of criticism as a good sign. The implementation of this would obviously be very straightforward... figuring out the most efficient place to update the category intersections table is one question, and I figure trying to get some consensus on a UI for it might be a hurdle :-).
Best Regards, Aerik
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Jared Williams wrote:
After doing a lot of searching around for a solution to concatenating multiple rows into one value, I think this is pretty good.
MySQL has GROUP_CONCAT() if that's what your after.
That's unfortunately not available in MySQL 4.0, so keep that in mind.
We're still running 4.0 at Wikimedia.
- -- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org