Marc Cousin wrote:
MySQL's way of doing text indexation:
- Articles updates are done on text (pagecontent for postgresql), and
page (I of course omit the rest of the schema as it's the same as far as
full text indexing is concerned)
- Sometimes, there is an indexation work that locks searchindex and puts
the new pages contents into it (retrieving the contents from the text
table). New page contents is retrieved thanks to recentchanges.
Well, no -- the search index is updated on page save, unless you go to a
lot of (usually unnecessary) trouble to disable it and do bulk updates.
Here's what happens normally:
Article::editUpdates() puts a SearchUpdate object on the post-edit
updates stack when the edit is complete. This gets run at the completion
of the request, along with many other updates (links tables, view
counts, etc).
SearchUpdate::doUpdate() takes the text, does some freaky markup
normalization, and sends it off to the search engine's
SearchEngine::update() or SeachEngine::updateTitle() methods.
The SearchEngine subclass for the relevant engine then updates its index
for the given page. In the case of the MySQL backend, this creates or
updates a row in the searchindex table.
What I'd like to know is the reason for this
asynchronous indexing in
mysql : is it because of the myisam table for full text (to avoid locking
during inserts), or for other performance reasons?
If you mean the non-default option to disable updates on edit, that
would be because MyISAM updates cause locking trouble if your site is
very busy, so you might prefer to do updates in bulk off-peak with the
search temporarily disabled.
However this is not the default.
- maintain searchindex synchronously with a
trigger. I would think the
best place to put this trigger would be on the page table. The trigger
should update searchindex when page_latest is put to a non zero value (I
think the trigger will have everything it needs when this trigger fires).
Can't do it -- the database has insufficient knowledge to interpret the
contents of page text in the general case.
* It doesn't know how to uncompress compressed revisions
* It doesn't know how to access text in external storage blobs
* It doesn't know the namespaces etc for proper markup normalization
Since the database can't read page text as a general case, this has to
be done from the application layer.
Ok. It's not the case right now with the PostgreSQL code : the database
maintains it's text index by itself via trigger. So doing the changes the
mysql way may correct other problems at the same time ?