Sorry, took me longer to get around to this than I'd planned. So, I restored the 6 million row categorylinks table to a local computer for testing and threw some sql at it. I got mixed results - in the first pass I'm using my "count... group by" approach and did different queries to get the pages at the intersection of 2 categories. I used at least semi-meaningful categories to try to make the testing at least somewhat representative of real possible usage.
I got several sets of results in under 1 second (the lowest time being .3 seconds), one query returned in 8 seconds and another in 36 seconds. I'm going to try re-running the same queries after the query cache is empty (gotta go learn about how to do that) several times to see what the repeatability is, then see if I can glean what the long query times correlate to (intuitively I'm guessing the come from the intersections of large categories, but I haven't tested that yet even though it's easy to do). I'll publish detailed results with figures and actual queries once I've got more data. (Plan to do this tonight or tomorrow night.)
Best Regards, Aerik
On 9/25/06, Aerik Sylvan aerik@thesylvans.com wrote: [snip]
I got several sets of results in under 1 second (the lowest time being .3 seconds), one query returned in 8 seconds and another in 36 seconds.
[snip]
.. 8 seconds... 36 seconds... ?
Why do we keep shooting ourselves in the head with a toy database?
Here is whats possible with postgresql (this is all hot cache, but it would all be hot cache on production too.. the page table just isn't that big)...
First some setup:
wikidb=> alter table page add column cats text[]; ALTER TABLE wikidb=> update page set cats = array(select cl_to from categorylinks where cl_from=page_id); UPDATE create index page_cat_idx on page using gin(cats); CREATE INDEX
Okay, lets see what this baby can do:
wikidb=> explain analyze select page_title,page_namespace,cats from page where cats @> '{Living_people,ABBA_members}'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on page (cost=104.37..17426.81 rows=4760 width=119) (actual time=11.900..11.912 rows=4 loops=1) Recheck Cond: (cats @> '{Living_people,ABBA_members}'::pg_catalog.text[]) -> Bitmap Index Scan on page_cat_idx (cost=0.00..104.37 rows=4760 width=0) (actual time=11.886..11.886 rows=4 loops=1) Index Cond: (cats @> '{Living_people,ABBA_members}'::pg_catalog.text[]) Total runtime: 11.973 ms (5 rows)
11.973ms, cool, and the results:
wikidb=> select page_title,page_namespace,cats from page where cats @> '{Living_people,ABBA_members}'; page_title | page_namespace | cats --------------------+----------------+------------------------------------------------------------------------------------------------------------------------------------------------------- Benny_Andersson | 0 | {1946_births,ABBA_members,Living_people,Musical_theatre_composers,Pop_pianists,Swedish_pianists,Swedish_songwriters} Björn_Ulvaeus | 0 | {1945_births,ABBA_members,Humanists,Living_people,Musical_theatre_composers,People_from_Gothenburg,Swedish_songwriters} Anni-Frid_Lyngstad | 0 | {1945_births,ABBA_members,Anni-Frid_Lyngstad,German_nobility,Living_people,Norwegian_musicians,Princesses,Swedish_female_singers,Swedish_pop_singers} Agnetha_Fältskog | 0 | {1950_births,ABBA_members,Agnetha_Fältskog,Living_people,People_from_Jönköping,Swedish_female_singers,Swedish_pop_singers} (4 rows)
Lets be mean:
wikidb=> select count(1) from page where cats @> '{Living_people}'; count -------- 112153 (1 row) Total runtime: 1156.697 ms
And mean some more:
wikidb=> select count(1) from page where cats @> '{GFDL_images}'; count -------- 100803 (1 row) Total runtime: 987.029 ms
If we tried the following in mysql it might take hours:
wikidb=> select count(1) from page where cats @> '{Living_people,GFDL_images}'; count ------- 0 (1 row) Total runtime: 25.201 ms
Hmm... 25ms < hours.
Enough people...
wikidb=> select count(1) from page where cats @> '{Unprintworthy_redirects}'; count ------- 14602 (1 row) Total runtime: 69.335 ms wikidb=> select count(1) from page where cats @> '{Unprintworthy_redirects,Redirects_from_plurals}'; count ------- 2415 (1 row) Total runtime: 12.002 ms
Hmm, what redirects from plurs are print worthy? :)
wikidb=> select page_namespace,page_title from page where cats @> '{Redirects_from_plurals}' and not cats @> '{Unprintworthy_redirects}' ; page_namespace | page_title ----------------+--------------------------------------------- 0 | Guide_dogs 4 | WikiProject_Red_Link_Recovery/Pluralisation 0 | Assistance_dogs 0 | Hearing_dogs (4 rows) Total runtime: 3.867 ms
Okay, how about some useful trivia:
wikidb=> select page_title from page where cats @> '{1982_deaths,American_novelists}'; page_title ------------------------- Ayn_Rand Philip_K._Dick Djuna_Barnes John_Cheever John_Gardner William_R._Burnett William_P._McGivern Theresa_Hak_Kyung_Cha Margaret_Culkin_Banning Beirne_Lay,_Jr. (10 rows) Total runtime: 0.417 ms
Wow, realistic queries go quick ... I bet PHP takes longer to *start* than that took to complete..
wikidb=> select page_title from page where cats @> '{American_philosophers,American_physicists}'; page_title ------------ (0 rows) Total runtime: 0.205 ms
So... submilisecond on normal queries.. Worst case at a second.. Cases which are evil for inner join are no issue..
Hmmmmm.......
How else do little choices like our database platform hold us back?
Gregory Maxwell wrote:
Why do we keep shooting ourselves in the head with a toy database?
Because this "toy database" is holding up a server that serves millions of page requests per second, quite unlike your own computer which can concentrate all its power on the individual queries you gave it.
Timwi
On 9/27/06, Timwi timwi@gmx.net wrote:
Because this "toy database" is holding up a server that serves millions of page requests per second, quite unlike your own computer which can concentrate all its power on the individual queries you gave it.
* We're not serving millions per second... not to the front ends, and certainly not to the databases. * What else is running on my computer is irrelevant: I provided timings, you can compare the results generated against MySQL yourself. I tried the GFDL_images intersect Living_people query on mysql as an innerjoin, and gave up and aborted it after 5 minutes... the same query takes 3.5 seconds on PostgreSQL. I'm assuming thats because PG can use bitmaps to combine the index outputs from the two halves of the intersection, which is a feature on the roadmap for mysql at least... but 3.5 is the PG number without using array indexing (which as I posted, is much faster). * If you'd like I can setup a parallel load test: the results should be more impressive than my single queries numbers suggested since single queries only use a single CPU. * Because of a number of factors (including support for special indexes) MySQL falls short when compared to other databases systems for everything other than the vanilla insert, update, select queries which it does quite well on... * The end result is that more powerful functionality such as category intersections and full text searching are frustrated by the requirement that we employ additional software outside of the DB then build and maintain glue.
On 9/28/06, Timwi timwi@gmx.net wrote:
Because this "toy database" is holding up a server that serves millions of page requests per second, quite unlike your own computer which can concentrate all its power on the individual queries you gave it.
Both Aerik and Greg were running on their own machines with local copies of the db (or parts thereof). To really get a comparison you'd have to run tests on the same machine, running the same queries. Squabbling over favourite software can wait until after they are compared side by side.
I think the question Aerik was posing was what factors make some cat intersections take longer than others, not what makes it take a certain number value.
I'm new to wikimedia tech, I have one question to ask, excuse me if I'm asking something that's wrong or already well known.
I'm in no position to judge in behalf of any solution but there must be a way to find the truth.
Are there some reserved machines for doing tests on wikimedia project ? If there is a test lab/equipment is there some kind of test procedure ? Are there tests for making benchmarks ?
Thanks
Milos
On 9/27/06, Stephen Bain stephen.bain@gmail.com wrote:
On 9/28/06, Timwi timwi@gmx.net wrote:
Because this "toy database" is holding up a server that serves millions of page requests per second, quite unlike your own computer which can concentrate all its power on the individual queries you gave it.
Both Aerik and Greg were running on their own machines with local copies of the db (or parts thereof). To really get a comparison you'd have to run tests on the same machine, running the same queries. Squabbling over favourite software can wait until after they are compared side by side.
I think the question Aerik was posing was what factors make some cat intersections take longer than others, not what makes it take a certain number value.
-- Stephen Bain stephen.bain@gmail.com _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
On 9/27/06, Stephen Bain stephen.bain@gmail.com wrote:
Both Aerik and Greg were running on their own machines with local copies of the db (or parts thereof). To really get a comparison you'd have to run tests on the same machine, running the same queries. Squabbling over favourite software can wait until after they are compared side by side.
MySQL can not run all of the same queries because it is less powerful software, which is my point.. it's not a matter of favorites.
I did test MySQL on the pathological case of intersecting two huge categories. I gave up after it chewed on it for five minutes.
The question is one of computational order.
On MySQL the only obvious and reasonable way to accomplish intersections is via an inner join.
MySQL will execute this with an index lookup on the first cat match followed by a loop testing each of the returned el_froms against the second cat. This is very slow if the first selected cat is large.
Given the same query innerjoin query Postgresql will separately build in memory bitmaps for the results of the index lookups and then apply the intersections to the bitmaps, then read the results out sequent ally. This is much faster in the worst case, and similar in the easy cases.
But postgresql goes further because it also supports both inverted indexes, and Russian Doll tree indexes for multivalue fields. Both these methods turn our intersections into very cheap operations with something like O(members log members) behavior and a very low constant. ... The case of intersecting two huge cats takes just miliseconds... the only thing kinda slow you can have it do is produce a hundred thousand results, which causes MVCC visibility to dominate the query time).
Both inverted index and RD-Tree would have had similar performance with only a few million rows. I used an inverted index in my example because it is what I use for analysis work... I often use indexes on links and images against *all revisions* of all pages. Inverted index scales better for that, keeping most operations subsecond.
I think the question Aerik was posing was what factors make some cat intersections take longer than others, not what makes it take a certain number value.
Didn't domas post last time this came up? As I said above with *mysql* you can't avoid the loop over the results of half the lookup... thats just going to be slow sometimes. Hopefully the database will be smart and loop over the smaller of the two, but I wouldn't count on it... and the case of two large cats is just going to stink.
So here is the straight dope:
1) Using joins for this will never be blindingly fast in the worst case. 2) MySQL joins are slower than they could be because MySQL doesn't support in memory bitmaps yet (although thats coming). 3) Other databases offer alternative indexing schemes which are fast in all cases.
So long as we are staying on MySQL (which is a good for many reasons, just not at all good for things which are more complex than simple web traffic), the only way for us to get acceptable results for this is going to be to call an outside piece of software exactly as we've done with Lucine for full text search.
It might be wise of us to instead consider figuring out how to integrate a link to PostgreSQL instead, which not only can do inverted index based full text search like Lucine, but can support our need for things like intersections, and future applications like fast geospatial queries. (Yes, mysql can do some of these things a bit better if you give up reliability and use MyISAM table type rather than innodb)...
Gregory Maxwell wrote:
So here is the straight dope:
- Using joins for this will never be blindingly fast in the worst case.
- MySQL joins are slower than they could be because MySQL doesn't
support in memory bitmaps yet (although thats coming). 3) Other databases offer alternative indexing schemes which are fast in all cases.
So long as we are staying on MySQL (which is a good for many reasons, just not at all good for things which are more complex than simple web traffic), the only way for us to get acceptable results for this is going to be to call an outside piece of software exactly as we've done with Lucine for full text search.
It might be wise of us to instead consider figuring out how to integrate a link to PostgreSQL instead, which not only can do inverted index based full text search like Lucine, but can support our need for things like intersections, and future applications like fast geospatial queries. (Yes, mysql can do some of these things a bit better if you give up reliability and use MyISAM table type rather than innodb)...
Now, doing _that_ seems like a good idea, since it
a) will solve just one problem, and, it seems, solve it rather well b) will not affect critical functions if it fails: only category intersection queries should be affected by a failure c) leaves the MySQL database as still the master data repository d) will give the developers some experience in using PostgreSQL, which might come in useful in the future...
-- Neil
wikitech-l@lists.wikimedia.org