Text search performance and postgresql - MediaWiki-l

18 Jul 2008

Hi,

I'm new on this list, and I'm not sure I'm posting at the right place.
I've got a question about text indexation.

I have a postgresql mediawiki instance installed on a very small server, and have been
working on tweaking it's performance these last few days. When all that could be done
from 
the OS and database parameters was done, I decided to give a look to the SQL queries.

The main problem I'm seeing is with the text searchs. I don't really know how
it's handled with mysql, but with postgresql, I think there may be some optimization
that could be done, 
but I'm not sure about it as I don't know all the code.

The text search query is this one :

EXPLAIN ANALYZE SELECT page_id, page_namespace, page_title, old_text AS page_text,
ts_rank(titlevector, to_tsquery('default','postgres')) AS rnk FROM page p,
revision r, 
pagecontent c WHERE p.page_latest = r.rev_id AND r.rev_text_id = c.old_id AND textvector
@@ to_tsquery('default','postgres') AND page_is_redirect = '0' AND
page_namespace IN 
(0,9,11) ORDER BY rnk DESC, page_id DESC LIMIT 20  OFFSET 0;

The plan is this one :

 Limit  (cost=718.49..718.50 rows=1 width=621) (actual time=305.943..305.982 rows=20
loops=1)
   ->  Sort  (cost=718.49..718.50 rows=1 width=621) (actual time=305.939..305.952
rows=20 loops=1)
         Sort Key: rank(p.titlevector, '''postgr'''::tsquery),
p.page_id
         ->  Nested Loop  (cost=0.00..718.48 rows=1 width=621) (actual
time=4.278..305.671 rows=44 loops=1)
               ->  Nested Loop  (cost=0.00..695.00 rows=21 width=204) (actual
time=0.829..76.740 rows=3210 loops=1)
                     ->  Seq Scan on page p  (cost=0.00..524.95 rows=21 width=204)
(actual time=0.804..19.686 rows=3210 loops=1)
                           Filter: (((page_is_redirect)::text = '0'::text) AND
(page_namespace = ANY ('{0,9,11}'::integer[])))
                     ->  Index Scan using revision_rev_id_key on revision r 
(cost=0.00..8.09 rows=1 width=8) (actual time=0.012..0.013 rows=1 loops=3210)
                           Index Cond: (p.page_latest = r.rev_id)
               ->  Index Scan using pagecontent_pkey on pagecontent c  (cost=0.00..1.11
rows=1 width=425) (actual time=0.069..0.069 rows=0 loops=3210)
                     Index Cond: (r.rev_text_id = c.old_id)
                     Filter: (textvector @@
'''postgr'''::tsquery)
 Total runtime: 306.118 ms

This plan joins page and revision to determine all the latests revisions of pagecontents,
then scans all matching pagecontent to determine which ones match my query.

There is also an other plan, depending on the amount of ram available and the estimate of
the number of 'latest pagecontents'

 Limit  (cost=2979.49..2979.50 rows=4 width=504) (actual time=224.594..224.646 rows=20
loops=1)
   ->  Sort  (cost=2979.49..2979.50 rows=4 width=504) (actual time=224.591..224.610
rows=20 loops=1)
         Sort Key: (ts_rank(p.titlevector,
'''postgr'''::tsquery)), p.page_id
         Sort Method:  top-N heapsort  Memory: 37kB
         ->  Hash Join  (cost=2689.31..2979.45 rows=4 width=504) (actual
time=211.141..224.432 rows=43 loops=1)
               Hash Cond: (p.page_latest = r.rev_id)
               ->  Seq Scan on page p  (cost=0.00..276.86 rows=3527 width=82) (actual
time=0.460..10.202 rows=3118 loops=1)
                     Filter: ((page_is_redirect = '0'::bpchar) AND (page_namespace
= ANY ('{0,9,11}'::integer[])))
               ->  Hash  (cost=2688.26..2688.26 rows=84 width=430) (actual
time=210.409..210.409 rows=1517 loops=1)
                     ->  Hash Join  (cost=534.76..2688.26 rows=84 width=430) (actual
time=26.557..207.725 rows=1517 loops=1)
                           Hash Cond: (r.rev_text_id = c.old_id)
                           ->  Seq Scan on revision r  (cost=0.00..1836.94 rows=84194
width=8) (actual time=0.023..98.850 rows=84194 loops=1)
                           ->  Hash  (cost=533.59..533.59 rows=93 width=430) (actual
time=18.182..18.182 rows=1515 loops=1)
                                 ->  Bitmap Heap Scan on pagecontent c 
(cost=190.83..533.59 rows=93 width=430) (actual time=0.585..15.663 rows=1515 loops=1)
                                       Recheck Cond: (textvector @@
'''postgr'''::tsquery)
                                       ->  Bitmap Index Scan on ts2_page_text2 
(cost=0.00..190.81 rows=93 width=0) (actual time=0.431..0.431 rows=1515 loops=1)
                                             Index Cond: (textvector @@
'''postgr'''::tsquery)
 Total runtime: 224.765 ms

Times are different because this machine is much more powerful.

This time, postgresql decides to get all articles from pagecontent, with all versions, and
then determines which ones are latest.

In both cases this is rather inefficient, as I guess we search only on the latest version
of the articles.

So I'm coming to the point ...

Is there a reason we index every version of every content ?

For instance, with my database, I've tested removing all textvectors from pagecontent
except for the latest version of each page. My text index size went from 400Mb to 15Mb.
And 
my text search times went down to a near constant 10ms for all queries. I can then
maintain the textvectors by modifying the trigger on pagecontent a bit to cleanup the
previous 
record while updating the table.

If I'm posting in the wrong place, please tell me. If the idea is stupid, please tell
me also :)

Cheers

Marc Cousin