On Jul 18, 2008, at 16:03, Marc Cousin <mcousin(a)sigma.fr>
EXPLAIN ANALYZE SELECT page_id, page_namespace,
page_title, old_text
AS page_text, ts_rank(titlevector, to_tsquery('default','postgres'))
AS rnk FROM page p, revision r,
pagecontent c WHERE p.page_latest = r.rev_id AND r.rev_text_id =
c.old_id AND textvector @@ to_tsquery('default','postgres') AND
page_is_redirect = '0' AND page_namespace IN
(0,9,11) ORDER BY rnk DESC, page_id DESC LIMIT 20 OFFSET 0;
Hmm; a fundamental problem here is that text.old_text is opaque to the
database -- it may contain compressed text, text in an alternate
encoding, or a reference to another table or database at the
application level.
This is part of why we use a separate table for the search index for
the MySQL search -- it's actually consistently readable to the db's
search engine as well as providing a summary table so the index only
has to cover live versions.
-- brion
The plan is this one :
Limit (cost=718.49..718.50 rows=1 width=621) (actual time=305.943..305.982
rows=20 loops=1)
-> Sort (cost=718.49..718.50 rows=1 width=621) (actual time=305.939..305.952
rows=20 loops=1)
Sort Key: rank(p.titlevector, '''postgr'''::tsquery),
p.page_id
-> Nested Loop (cost=0.00..718.48 rows=1 width=621)
(actual time=4.278..305.671 rows=44 loops=1)
-> Nested Loop (cost=0.00..695.00 rows=21 width=204)
(actual time=0.829..76.740 rows=3210 loops=1)
-> Seq Scan on page p (cost=0.00..524.95
rows=21 width=204) (actual time=0.804..19.686 rows=3210 loops=1)
Filter: (((page_is_redirect)::text =
'0'::text) AND (page_namespace = ANY ('{0,9,11}'::integer[])))
-> Index Scan using revision_rev_id_key on
revision r (cost=0.00..8.09 rows=1 width=8) (actual
time=0.012..0.013 rows=1 loops=3210)
Index Cond: (p.page_latest = r.rev_id)
-> Index Scan using pagecontent_pkey on pagecontent
c (cost=0.00..1.11 rows=1 width=425) (actual time=0.069..0.069
rows=0 loops=3210)
Index Cond: (r.rev_text_id = c.old_id)
Filter: (textvector @@
'''postgr'''::tsquery)
Total runtime: 306.118 ms
This plan joins page and revision to determine all the latests
revisions of pagecontents, then scans all matching pagecontent to
determine which ones match my query.
There is also an other plan, depending on the amount of ram
available and the estimate of the number of 'latest pagecontents'
Limit (cost=2979.49..2979.50 rows=4 width=504) (actual time=224.594..224.646
rows=20 loops=1)
-> Sort (cost=2979.49..2979.50 rows=4 width=504) (actual
time=224.591..224.610 rows=20 loops=1)
Sort Key: (ts_rank(p.titlevector,
'''postgr'''::tsquery)),
p.page_id
Sort Method: top-N heapsort Memory: 37kB
-> Hash Join (cost=2689.31..2979.45 rows=4 width=504)
(actual time=211.141..224.432 rows=43 loops=1)
Hash Cond: (p.page_latest = r.rev_id)
-> Seq Scan on page p (cost=0.00..276.86 rows=3527
width=82) (actual time=0.460..10.202 rows=3118 loops=1)
Filter: ((page_is_redirect = '0'::bpchar) AND
(page_namespace = ANY ('{0,9,11}'::integer[])))
-> Hash (cost=2688.26..2688.26 rows=84 width=430)
(actual time=210.409..210.409 rows=1517 loops=1)
-> Hash Join (cost=534.76..2688.26 rows=84
width=430) (actual time=26.557..207.725 rows=1517 loops=1)
Hash Cond: (r.rev_text_id = c.old_id)
-> Seq Scan on revision r
(cost=0.00..1836.94 rows=84194 width=8) (actual time=0.023..98.850
rows=84194 loops=1)
-> Hash (cost=533.59..533.59 rows=93
width=430) (actual time=18.182..18.182 rows=1515 loops=1)
-> Bitmap Heap Scan on pagecontent
c (cost=190.83..533.59 rows=93 width=430) (actual
time=0.585..15.663 rows=1515 loops=1)
Recheck Cond: (textvector @@
'''postgr'''::tsquery)
-> Bitmap Index Scan on
ts2_page_text2 (cost=0.00..190.81 rows=93 width=0) (actual time=0.431..0.431
rows=1515 loops=1)
Index Cond: (textvector
@@ '''postgr'''::tsquery)
Total runtime: 224.765 ms
Times are different because this machine is much more powerful.
This time, postgresql decides to get all articles from pagecontent,
with all versions, and then determines which ones are latest.
In both cases this is rather inefficient, as I guess we search only
on the latest version of the articles.
So I'm coming to the point ...
Is there a reason we index every version of every content ?
For instance, with my database, I've tested removing all textvectors
from pagecontent except for the latest version of each page. My text
index size went from 400Mb to 15Mb. And
my text search times went down to a near constant 10ms for all
queries. I can then maintain the textvectors by modifying the
trigger on pagecontent a bit to cleanup the previous
record while updating the table.
If I'm posting in the wrong place, please tell me. If the idea is
stupid, please tell me also :)
Cheers
Marc Cousin
_______________________________________________
MediaWiki-l mailing list
MediaWiki-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l