Re: [Mediawiki-l] Text search performance and postgresql

19 Jul 2008

On Jul 18, 2008, at 16:03, Marc Cousin &lt;mcousin(a)sigma.fr&gt;
...
  EXPLAIN ANALYZE SELECT page_id, page_namespace,
page_title, old_text  
 AS page_text, ts_rank(titlevector, to_tsquery('default','postgres'))  
 AS rnk FROM page p, revision r,
 pagecontent c WHERE p.page_latest = r.rev_id AND r.rev_text_id =  
 c.old_id AND textvector @@ to_tsquery('default','postgres') AND  
 page_is_redirect = '0' AND page_namespace IN
 (0,9,11) ORDER BY rnk DESC, page_id DESC LIMIT 20  OFFSET 0; 
Hmm; a fundamental problem here is that text.old_text is opaque to the  
database -- it may contain compressed text, text in an alternate  
encoding, or a reference to another table or database at the  
application level.

This is part of why we use a separate table for the search index for  
the MySQL search -- it's actually consistently readable to the db's  
search engine as well as providing a summary table so the index only  
has to cover live versions.

-- brion
...

 The plan is this one :

 Limit  (cost=718.49..718.50 rows=1 width=621) (actual time=305.943..305.982 
  rows=20 loops=1)
   ->  Sort  (cost=718.49..718.50 rows=1 width=621) (actual time=305.939..305.952 
  rows=20 loops=1)
         Sort Key: rank(p.titlevector, '''postgr'''::tsquery),  
 p.page_id
         ->  Nested Loop  (cost=0.00..718.48 rows=1 width=621)  
 (actual time=4.278..305.671 rows=44 loops=1)
               ->  Nested Loop  (cost=0.00..695.00 rows=21 width=204)  
 (actual time=0.829..76.740 rows=3210 loops=1)
                     ->  Seq Scan on page p  (cost=0.00..524.95  
 rows=21 width=204) (actual time=0.804..19.686 rows=3210 loops=1)
                           Filter: (((page_is_redirect)::text =  
 '0'::text) AND (page_namespace = ANY ('{0,9,11}'::integer[])))
                     ->  Index Scan using revision_rev_id_key on  
 revision r  (cost=0.00..8.09 rows=1 width=8) (actual  
 time=0.012..0.013 rows=1 loops=3210)
                           Index Cond: (p.page_latest = r.rev_id)
               ->  Index Scan using pagecontent_pkey on pagecontent  
 c  (cost=0.00..1.11 rows=1 width=425) (actual time=0.069..0.069  
 rows=0 loops=3210)
                     Index Cond: (r.rev_text_id = c.old_id)
                     Filter: (textvector @@
'''postgr'''::tsquery)
 Total runtime: 306.118 ms

 This plan joins page and revision to determine all the latests  
 revisions of pagecontents, then scans all matching pagecontent to  
 determine which ones match my query.

 There is also an other plan, depending on the amount of ram  
 available and the estimate of the number of 'latest pagecontents'

 Limit  (cost=2979.49..2979.50 rows=4 width=504) (actual time=224.594..224.646 
  rows=20 loops=1)
   ->  Sort  (cost=2979.49..2979.50 rows=4 width=504) (actual  
 time=224.591..224.610 rows=20 loops=1)
         Sort Key: (ts_rank(p.titlevector,
'''postgr'''::tsquery)),  
 p.page_id
         Sort Method:  top-N heapsort  Memory: 37kB
         ->  Hash Join  (cost=2689.31..2979.45 rows=4 width=504)  
 (actual time=211.141..224.432 rows=43 loops=1)
               Hash Cond: (p.page_latest = r.rev_id)
               ->  Seq Scan on page p  (cost=0.00..276.86 rows=3527  
 width=82) (actual time=0.460..10.202 rows=3118 loops=1)
                     Filter: ((page_is_redirect = '0'::bpchar) AND  
 (page_namespace = ANY ('{0,9,11}'::integer[])))
               ->  Hash  (cost=2688.26..2688.26 rows=84 width=430)  
 (actual time=210.409..210.409 rows=1517 loops=1)
                     ->  Hash Join  (cost=534.76..2688.26 rows=84  
 width=430) (actual time=26.557..207.725 rows=1517 loops=1)
                           Hash Cond: (r.rev_text_id = c.old_id)
                           ->  Seq Scan on revision r   
 (cost=0.00..1836.94 rows=84194 width=8) (actual time=0.023..98.850  
 rows=84194 loops=1)
                           ->  Hash  (cost=533.59..533.59 rows=93  
 width=430) (actual time=18.182..18.182 rows=1515 loops=1)
                                 ->  Bitmap Heap Scan on pagecontent  
 c  (cost=190.83..533.59 rows=93 width=430) (actual  
 time=0.585..15.663 rows=1515 loops=1)
                                       Recheck Cond: (textvector @@  
 '''postgr'''::tsquery)
                                       ->  Bitmap Index Scan on  
 ts2_page_text2  (cost=0.00..190.81 rows=93 width=0) (actual time=0.431..0.431 
  rows=1515 loops=1)
                                             Index Cond: (textvector  
 @@ '''postgr'''::tsquery)
 Total runtime: 224.765 ms

 Times are different because this machine is much more powerful.

 This time, postgresql decides to get all articles from pagecontent,  
 with all versions, and then determines which ones are latest.

 In both cases this is rather inefficient, as I guess we search only  
 on the latest version of the articles.

 So I'm coming to the point ...

 Is there a reason we index every version of every content ?

 For instance, with my database, I've tested removing all textvectors  
 from pagecontent except for the latest version of each page. My text  
 index size went from 400Mb to 15Mb. And
 my text search times went down to a near constant 10ms for all  
 queries. I can then maintain the textvectors by modifying the  
 trigger on pagecontent a bit to cleanup the previous
 record while updating the table.

 If I'm posting in the wrong place, please tell me. If the idea is  
 stupid, please tell me also :)

 Cheers

 Marc Cousin

 _______________________________________________
 MediaWiki-l mailing list
 MediaWiki-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/mediawiki-l 

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

Re: [Mediawiki-l] Text search performance and postgresql