Re: [Mediawiki-l] Text search performance and postgresql

30 Jul 2008

I should clarify that compression of individual revision text is only  
one of several DB-opaque ways text may be stored.

-- brion vibber (brion @ wikimedia.org)

On Jul 30, 2008, at 1:36, Marc Cousin &lt;mcousin(a)sigma.fr&gt; wrote:

...
  Thanks for your aswers.

 Here's what I see from my (narrow) understanding of how mediawiki  
 works.

 - I already tried to change the parameters you mention, but as they  
 were
 correct for my (pretty old) machine, this brought me even worse  
 results (this
 was expected). The bad plans come from the fact we have a lot of  
 versions of
 page revisions indexed in the page table.

 Anyhow, I find it strange that we use the pagecontent table to store  
 those
 full text vectors when they are in fact the full text vectors for a  
 page.
 Wouldn't this textvector be better placed either in the page table  
 or in a
 searchindex with a 1:1 relationship like in mysql ? That would be more
 logical to the planner : how could it guess that the records from the
 pagecontent table that are indexed are indeed the latest revisions  
 from the
 page table ? Whereas if the vector is in the page table, the query  
 becomes a
 more classical join : we find the records matching the full text  
 criterion
 from page table, then join to revision and pagecontent to retrieve the
 associated text.

 - all revisions of all pages were indexed in the text table. Maybe  
 because
 we're still using 1.9.1 ? (there is no update in SearchPostgres.php  
 here)

 If I understand correctly what you explain below, the content of  
 text is
 not 'opacified' for postgresql, as there is no point in doing that  
 because of
 toast compression, so working with triggers is the logical way to go  
 with
 postgresql ?

 On Tuesday 29 July 2008 17:29:44 Greg Sabino Mullane wrote:
  Marc Cousin wrote:
  The plan is this one 
 [explain analyze plans]

 Those are pretty bad plans - you might benefit from some simple  
 tuning.
 Try lowering random_page_cost and boosting effective_cache_size. Or  
 come
 round to #postgresql on freenode, for more free tuning help than  
 you can
 shake a stick at.

 The query you gave ran in 56 ms for me on what I'm guessing is
 a larger wiki (~ 400K rows in pagecontent) (and a pretty busy box).

 You should be seeing a Bitmap Index Scan on "ts2_page_text",

 If you find yourself using the same namespace restrictions a lot,  
 you can
 create some custom indexes as well, e.g.

 CREATE INDEX page_index1 ON page(page_title) WHERE page_is_redirect  
 = 0
 AND page_namespace IN (0,9,11);

  Is there a reason we index every version of every
content ? 
 We don't. Or at least, we were not intending to. You can grep for  
 UPDATE
 in SearchPostgres.php. I just changed (r38184) the code to double- 
 check we
 nullify *all* old revisions: this may explain part of what you were  
 seeing
 before. You can check on individual pages and see if they are all
 null-but-latest like so:

 SELECT CASE WHEN textvector IS NULL THEN 1 ELSE 0 END AS isnull,  
 COUNT(*)
 FROM pagecontent WHERE old_id IN (SELECT rev_text_id FROM revision  
 WHERE
 rev_page = (SELECT page_id FROM page WHERE page_namespace = 0 AND
 page_title = 'Foobar' LIMIT 1)) GROUP BY 1;

 (mostly for the benefit of other people reading this thread, since  
 you
 said you already manually updated the database)

 Brion Vibber wrote:
  Hmm; a fundamental problem here is that
text.old_text is opaque to  
 the
 database -- it may contain compressed text, text in an alternate
 encoding, or a reference to another table or database at the
 application level. 
 Right: $wgCompressRevisions is not recommended for a Postgres  
 backend.
 Should probably document that somewhere. Postgres will compress
 automatically anyway.

 Not to say that we might not want to change the way things are  
 done, but
 the easiest solution to the original poster's problem is to get the  
 query
 optimized first, and figure out why the old revision are not being  
 nulled.

 Platonides wrote:
  The postgresql backend is not too well
maintained, people wanting to
 postgre appear from time to time and give it a kick.
 I don't remember who is currently supposed to be taking care of  
 it, but
 i interpret that as he hasn't commented on this, he doesn't  
 oppose ;) 
 Wow, that's a bit harsh - not too well maintained? Hardly. :)  
 Myself and
 others have put in a lot of work to get it working and keep it so,
 including paving the way for other database backends by discovering  
 and
 addressing lots of mysqlisms in the code. I apologize for not  
 replying:
 I somehow got recently unsubscribed. 

 _______________________________________________
 MediaWiki-l mailing list
 MediaWiki-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/mediawiki-l 

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

Re: [Mediawiki-l] Text search performance and postgresql