Re: [Mediawiki-l] Text search performance and postgresql

28 Jul 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Marc Cousin wrote:
...
  MySQL's way of doing text indexation:
 - Articles updates are done on text (pagecontent for postgresql), and page (I 
 of course omit the rest of the schema as it's the same as far as full text 
 indexing is concerned)
 - Sometimes, there is an indexation work that locks searchindex and puts the 
 new pages contents into it (retrieving the contents from the text table). New 
 page contents is retrieved thanks to recentchanges. 
Well, no -- the search index is updated on page save, unless you go to a
lot of (usually unnecessary) trouble to disable it and do bulk updates.

Here's what happens normally:

Article::editUpdates() puts a SearchUpdate object on the post-edit
updates stack when the edit is complete. This gets run at the completion
of the request, along with many other updates (links tables, view
counts, etc).

SearchUpdate::doUpdate() takes the text, does some freaky markup
normalization, and sends it off to the search engine's
SearchEngine::update() or SeachEngine::updateTitle() methods.

The SearchEngine subclass for the relevant engine then updates its index
for the given page. In the case of the MySQL backend, this creates or
updates a row in the searchindex table.

...
  What I'd like to know is the reason for this
asynchronous indexing in mysql : 
 is it because of the myisam table for full text (to avoid locking during 
 inserts), or for other performance reasons? 
If you mean the non-default option to disable updates on edit, that
would be because MyISAM updates cause locking trouble if your site is
very busy, so you might prefer to do updates in bulk off-peak with the
search temporarily disabled.

However this is not the default.

...
  - maintain searchindex synchronously with a trigger. I
would think the best 
 place to put this trigger would be on the page table. The trigger should 
 update searchindex when page_latest is put to a non zero value (I think the 
 trigger will have everything it needs when this trigger fires). 
Can't do it -- the database has insufficient knowledge to interpret the
contents of page text in the general case.

* It doesn't know how to uncompress compressed revisions
* It doesn't know how to access text in external storage blobs
* It doesn't know the namespaces etc for proper markup normalization

Since the database can't read page text as a general case, this has to
be done from the application layer.

- -- brion
...PGP SIGNATURE...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkiOEnIACgkQwRnhpk1wk47lnACg2SApw9M6bxbcwsS9IhzFC/3Z
I1gAn3ZWFwnABt82JvNkJHQiax7RNRXW
=ES9a
-----END PGP SIGNATURE----- 

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

Re: [Mediawiki-l] Text search performance and postgresql