Re: [Mediawiki-l] Text search performance and postgresql

29 Jul 2008

On Monday 28 July 2008 20:39:46 Brion Vibber wrote:
...
  Marc Cousin wrote:
  MySQL's way of doing text indexation:
 - Articles updates are done on text (pagecontent for postgresql), and
 page (I of course omit the rest of the schema as it's the same as far as
 full text indexing is concerned)
 - Sometimes, there is an indexation work that locks searchindex and puts
 the new pages contents into it (retrieving the contents from the text
 table). New page contents is retrieved thanks to recentchanges. 
 Well, no -- the search index is updated on page save, unless you go to a
 lot of (usually unnecessary) trouble to disable it and do bulk updates.

 Here's what happens normally:

 Article::editUpdates() puts a SearchUpdate object on the post-edit
 updates stack when the edit is complete. This gets run at the completion
 of the request, along with many other updates (links tables, view
 counts, etc).

 SearchUpdate::doUpdate() takes the text, does some freaky markup
 normalization, and sends it off to the search engine's
 SearchEngine::update() or SeachEngine::updateTitle() methods.

 The SearchEngine subclass for the relevant engine then updates its index
 for the given page. In the case of the MySQL backend, this creates or
 updates a row in the searchindex table.

  What I'd like to know is the reason for this
asynchronous indexing in
 mysql : is it because of the myisam table for full text (to avoid locking
 during inserts), or for other performance reasons? 
 If you mean the non-default option to disable updates on edit, that
 would be because MyISAM updates cause locking trouble if your site is
 very busy, so you might prefer to do updates in bulk off-peak with the
 search temporarily disabled.

 However this is not the default.

  - maintain searchindex synchronously with a
trigger. I would think the
 best place to put this trigger would be on the page table. The trigger
 should update searchindex when page_latest is put to a non zero value (I
 think the trigger will have everything it needs when this trigger fires). 
 Can't do it -- the database has insufficient knowledge to interpret the
 contents of page text in the general case.

 * It doesn't know how to uncompress compressed revisions
 * It doesn't know how to access text in external storage blobs
 * It doesn't know the namespaces etc for proper markup normalization

 Since the database can't read page text as a general case, this has to
 be done from the application layer. 
Ok. It's not the case right now with the PostgreSQL code : the database 
maintains it's text index by itself via trigger. So doing the changes the 
mysql way may correct other problems at the same time ? 

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

Re: [Mediawiki-l] Text search performance and postgresql