[Mediawiki-l] Text search performance and postgresql

Wed Jul 30 17:04:02 UTC 2008

I should clarify that compression of individual revision text is only  
one of several DB-opaque ways text may be stored.

-- brion vibber (brion @ wikimedia.org)

On Jul 30, 2008, at 1:36, Marc Cousin <mcousin at sigma.fr> wrote:

> Thanks for your aswers.
>
> Here's what I see from my (narrow) understanding of how mediawiki  
> works.
>
> - I already tried to change the parameters you mention, but as they  
> were
> correct for my (pretty old) machine, this brought me even worse  
> results (this
> was expected). The bad plans come from the fact we have a lot of  
> versions of
> page revisions indexed in the page table.
>
> Anyhow, I find it strange that we use the pagecontent table to store  
> those
> full text vectors when they are in fact the full text vectors for a  
> page.
> Wouldn't this textvector be better placed either in the page table  
> or in a
> searchindex with a 1:1 relationship like in mysql ? That would be more
> logical to the planner : how could it guess that the records from the
> pagecontent table that are indexed are indeed the latest revisions  
> from the
> page table ? Whereas if the vector is in the page table, the query  
> becomes a
> more classical join : we find the records matching the full text  
> criterion
> from page table, then join to revision and pagecontent to retrieve the
> associated text.
>
> - all revisions of all pages were indexed in the text table. Maybe  
> because
> we're still using 1.9.1 ? (there is no update in SearchPostgres.php  
> here)
>
>
> If I understand correctly what you explain below, the content of  
> text is
> not 'opacified' for postgresql, as there is no point in doing that  
> because of
> toast compression, so working with triggers is the logical way to go  
> with
> postgresql ?
>
> On Tuesday 29 July 2008 17:29:44 Greg Sabino Mullane wrote:
>> Marc Cousin wrote:
>>> The plan is this one
>>
>> [explain analyze plans]
>>
>> Those are pretty bad plans - you might benefit from some simple  
>> tuning.
>> Try lowering random_page_cost and boosting effective_cache_size. Or  
>> come
>> round to #postgresql on freenode, for more free tuning help than  
>> you can
>> shake a stick at.
>>
>> The query you gave ran in 56 ms for me on what I'm guessing is
>> a larger wiki (~ 400K rows in pagecontent) (and a pretty busy box).
>>
>> You should be seeing a Bitmap Index Scan on "ts2_page_text",
>>
>> If you find yourself using the same namespace restrictions a lot,  
>> you can
>> create some custom indexes as well, e.g.
>>
>> CREATE INDEX page_index1 ON page(page_title) WHERE page_is_redirect  
>> = 0
>> AND page_namespace IN (0,9,11);
>>
>>> Is there a reason we index every version of every content ?
>>
>> We don't. Or at least, we were not intending to. You can grep for  
>> UPDATE
>> in SearchPostgres.php. I just changed (r38184) the code to double- 
>> check we
>> nullify *all* old revisions: this may explain part of what you were  
>> seeing
>> before. You can check on individual pages and see if they are all
>> null-but-latest like so:
>>
>> SELECT CASE WHEN textvector IS NULL THEN 1 ELSE 0 END AS isnull,  
>> COUNT(*)
>> FROM pagecontent WHERE old_id IN (SELECT rev_text_id FROM revision  
>> WHERE
>> rev_page = (SELECT page_id FROM page WHERE page_namespace = 0 AND
>> page_title = 'Foobar' LIMIT 1)) GROUP BY 1;
>>
>> (mostly for the benefit of other people reading this thread, since  
>> you
>> said you already manually updated the database)
>>
>> Brion Vibber wrote:
>>> Hmm; a fundamental problem here is that text.old_text is opaque to  
>>> the
>>> database -- it may contain compressed text, text in an alternate
>>> encoding, or a reference to another table or database at the
>>> application level.
>>
>> Right: $wgCompressRevisions is not recommended for a Postgres  
>> backend.
>> Should probably document that somewhere. Postgres will compress
>> automatically anyway.
>>
>> Not to say that we might not want to change the way things are  
>> done, but
>> the easiest solution to the original poster's problem is to get the  
>> query
>> optimized first, and figure out why the old revision are not being  
>> nulled.
>>
>> Platonides wrote:
>>> The postgresql backend is not too well maintained, people wanting to
>>> postgre appear from time to time and give it a kick.
>>> I don't remember who is currently supposed to be taking care of  
>>> it, but
>>> i interpret that as he hasn't commented on this, he doesn't  
>>> oppose ;)
>>
>> Wow, that's a bit harsh - not too well maintained? Hardly. :)  
>> Myself and
>> others have put in a lot of work to get it working and keep it so,
>> including paving the way for other database backends by discovering  
>> and
>> addressing lots of mysqlisms in the code. I apologize for not  
>> replying:
>> I somehow got recently unsubscribed.
>
>
>
> _______________________________________________
> MediaWiki-l mailing list
> MediaWiki-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l