New subject: [Mediawiki-l] Text search performance and postgresql

18 Jul 2008


      Hi,
I'm new on this list, and I'm not sure I'm posting at the right place. I've got a question about text indexation.
I have a postgresql mediawiki instance installed on a very small server, and have been working on tweaking it's performance these last few days. When all that could be done from 
the OS and database parameters was done, I decided to give a look to the SQL queries.
The main problem I'm seeing is with the text searchs. I don't really know how it's handled with mysql, but with postgresql, I think there may be some optimization that could be done, 
but I'm not sure about it as I don't know all the code.
The text search query is this one :
EXPLAIN ANALYZE SELECT page_id, page_namespace, page_title, old_text AS page_text, ts_rank(titlevector, to_tsquery('default','postgres')) AS rnk FROM page p, revision r, 
pagecontent c WHERE p.page_latest = r.rev_id AND r.rev_text_id = c.old_id AND textvector @@ to_tsquery('default','postgres') AND page_is_redirect = '0' AND page_namespace IN 
(0,9,11) ORDER BY rnk DESC, page_id DESC LIMIT 20  OFFSET 0;
The plan is this one :
Limit  (cost=718.49..718.50 rows=1 width=621) (actual time=305.943..305.982 rows=20 loops=1)
   ->  Sort  (cost=718.49..718.50 rows=1 width=621) (actual time=305.939..305.952 rows=20 loops=1)
         Sort Key: rank(p.titlevector, '''postgr'''::tsquery), p.page_id
         ->  Nested Loop  (cost=0.00..718.48 rows=1 width=621) (actual time=4.278..305.671 rows=44 loops=1)
               ->  Nested Loop  (cost=0.00..695.00 rows=21 width=204) (actual time=0.829..76.740 rows=3210 loops=1)
                     ->  Seq Scan on page p  (cost=0.00..524.95 rows=21 width=204) (actual time=0.804..19.686 rows=3210 loops=1)
                           Filter: (((page_is_redirect)::text = '0'::text) AND (page_namespace = ANY ('{0,9,11}'::integer[])))
                     ->  Index Scan using revision_rev_id_key on revision r  (cost=0.00..8.09 rows=1 width=8) (actual time=0.012..0.013 rows=1 loops=3210)
                           Index Cond: (p.page_latest = r.rev_id)
               ->  Index Scan using pagecontent_pkey on pagecontent c  (cost=0.00..1.11 rows=1 width=425) (actual time=0.069..0.069 rows=0 loops=3210)
                     Index Cond: (r.rev_text_id = c.old_id)
                     Filter: (textvector @@ '''postgr'''::tsquery)
 Total runtime: 306.118 ms
This plan joins page and revision to determine all the latests revisions of pagecontents, then scans all matching pagecontent to determine which ones match my query.
There is also an other plan, depending on the amount of ram available and the estimate of the number of 'latest pagecontents'
Limit  (cost=2979.49..2979.50 rows=4 width=504) (actual time=224.594..224.646 rows=20 loops=1)
   ->  Sort  (cost=2979.49..2979.50 rows=4 width=504) (actual time=224.591..224.610 rows=20 loops=1)
         Sort Key: (ts_rank(p.titlevector, '''postgr'''::tsquery)), p.page_id
         Sort Method:  top-N heapsort  Memory: 37kB
         ->  Hash Join  (cost=2689.31..2979.45 rows=4 width=504) (actual time=211.141..224.432 rows=43 loops=1)
               Hash Cond: (p.page_latest = r.rev_id)
               ->  Seq Scan on page p  (cost=0.00..276.86 rows=3527 width=82) (actual time=0.460..10.202 rows=3118 loops=1)
                     Filter: ((page_is_redirect = '0'::bpchar) AND (page_namespace = ANY ('{0,9,11}'::integer[])))
               ->  Hash  (cost=2688.26..2688.26 rows=84 width=430) (actual time=210.409..210.409 rows=1517 loops=1)
                     ->  Hash Join  (cost=534.76..2688.26 rows=84 width=430) (actual time=26.557..207.725 rows=1517 loops=1)
                           Hash Cond: (r.rev_text_id = c.old_id)
                           ->  Seq Scan on revision r  (cost=0.00..1836.94 rows=84194 width=8) (actual time=0.023..98.850 rows=84194 loops=1)
                           ->  Hash  (cost=533.59..533.59 rows=93 width=430) (actual time=18.182..18.182 rows=1515 loops=1)
                                 ->  Bitmap Heap Scan on pagecontent c  (cost=190.83..533.59 rows=93 width=430) (actual time=0.585..15.663 rows=1515 loops=1)
                                       Recheck Cond: (textvector @@ '''postgr'''::tsquery)
                                       ->  Bitmap Index Scan on ts2_page_text2  (cost=0.00..190.81 rows=93 width=0) (actual time=0.431..0.431 rows=1515 loops=1)
                                             Index Cond: (textvector @@ '''postgr'''::tsquery)
 Total runtime: 224.765 ms
Times are different because this machine is much more powerful.
This time, postgresql decides to get all articles from pagecontent, with all versions, and then determines which ones are latest.
In both cases this is rather inefficient, as I guess we search only on the latest version of the articles.
So I'm coming to the point ...
Is there a reason we index every version of every content ?
For instance, with my database, I've tested removing all textvectors from pagecontent except for the latest version of each page. My text index size went from 400Mb to 15Mb. And 
my text search times went down to a near constant 10ms for all queries. I can then maintain the textvectors by modifying the trigger on pagecontent a bit to cleanup the previous 
record while updating the table.
If I'm posting in the wrong place, please tell me. If the idea is stupid, please tell me also :)
Cheers
Marc Cousin