Hi,
I'm new on this list, and I'm not sure I'm posting at the right place. I've got a question about text indexation.
I have a postgresql mediawiki instance installed on a very small server, and have been working on tweaking it's performance these last few days. When all that could be done from the OS and database parameters was done, I decided to give a look to the SQL queries.
The main problem I'm seeing is with the text searchs. I don't really know how it's handled with mysql, but with postgresql, I think there may be some optimization that could be done, but I'm not sure about it as I don't know all the code.
The text search query is this one :
EXPLAIN ANALYZE SELECT page_id, page_namespace, page_title, old_text AS page_text, ts_rank(titlevector, to_tsquery('default','postgres')) AS rnk FROM page p, revision r, pagecontent c WHERE p.page_latest = r.rev_id AND r.rev_text_id = c.old_id AND textvector @@ to_tsquery('default','postgres') AND page_is_redirect = '0' AND page_namespace IN (0,9,11) ORDER BY rnk DESC, page_id DESC LIMIT 20 OFFSET 0;
The plan is this one :
Limit (cost=718.49..718.50 rows=1 width=621) (actual time=305.943..305.982 rows=20 loops=1) -> Sort (cost=718.49..718.50 rows=1 width=621) (actual time=305.939..305.952 rows=20 loops=1) Sort Key: rank(p.titlevector, '''postgr'''::tsquery), p.page_id -> Nested Loop (cost=0.00..718.48 rows=1 width=621) (actual time=4.278..305.671 rows=44 loops=1) -> Nested Loop (cost=0.00..695.00 rows=21 width=204) (actual time=0.829..76.740 rows=3210 loops=1) -> Seq Scan on page p (cost=0.00..524.95 rows=21 width=204) (actual time=0.804..19.686 rows=3210 loops=1) Filter: (((page_is_redirect)::text = '0'::text) AND (page_namespace = ANY ('{0,9,11}'::integer[]))) -> Index Scan using revision_rev_id_key on revision r (cost=0.00..8.09 rows=1 width=8) (actual time=0.012..0.013 rows=1 loops=3210) Index Cond: (p.page_latest = r.rev_id) -> Index Scan using pagecontent_pkey on pagecontent c (cost=0.00..1.11 rows=1 width=425) (actual time=0.069..0.069 rows=0 loops=3210) Index Cond: (r.rev_text_id = c.old_id) Filter: (textvector @@ '''postgr'''::tsquery) Total runtime: 306.118 ms
This plan joins page and revision to determine all the latests revisions of pagecontents, then scans all matching pagecontent to determine which ones match my query.
There is also an other plan, depending on the amount of ram available and the estimate of the number of 'latest pagecontents'
Limit (cost=2979.49..2979.50 rows=4 width=504) (actual time=224.594..224.646 rows=20 loops=1) -> Sort (cost=2979.49..2979.50 rows=4 width=504) (actual time=224.591..224.610 rows=20 loops=1) Sort Key: (ts_rank(p.titlevector, '''postgr'''::tsquery)), p.page_id Sort Method: top-N heapsort Memory: 37kB -> Hash Join (cost=2689.31..2979.45 rows=4 width=504) (actual time=211.141..224.432 rows=43 loops=1) Hash Cond: (p.page_latest = r.rev_id) -> Seq Scan on page p (cost=0.00..276.86 rows=3527 width=82) (actual time=0.460..10.202 rows=3118 loops=1) Filter: ((page_is_redirect = '0'::bpchar) AND (page_namespace = ANY ('{0,9,11}'::integer[]))) -> Hash (cost=2688.26..2688.26 rows=84 width=430) (actual time=210.409..210.409 rows=1517 loops=1) -> Hash Join (cost=534.76..2688.26 rows=84 width=430) (actual time=26.557..207.725 rows=1517 loops=1) Hash Cond: (r.rev_text_id = c.old_id) -> Seq Scan on revision r (cost=0.00..1836.94 rows=84194 width=8) (actual time=0.023..98.850 rows=84194 loops=1) -> Hash (cost=533.59..533.59 rows=93 width=430) (actual time=18.182..18.182 rows=1515 loops=1) -> Bitmap Heap Scan on pagecontent c (cost=190.83..533.59 rows=93 width=430) (actual time=0.585..15.663 rows=1515 loops=1) Recheck Cond: (textvector @@ '''postgr'''::tsquery) -> Bitmap Index Scan on ts2_page_text2 (cost=0.00..190.81 rows=93 width=0) (actual time=0.431..0.431 rows=1515 loops=1) Index Cond: (textvector @@ '''postgr'''::tsquery) Total runtime: 224.765 ms
Times are different because this machine is much more powerful.
This time, postgresql decides to get all articles from pagecontent, with all versions, and then determines which ones are latest.
In both cases this is rather inefficient, as I guess we search only on the latest version of the articles.
So I'm coming to the point ...
Is there a reason we index every version of every content ?
For instance, with my database, I've tested removing all textvectors from pagecontent except for the latest version of each page. My text index size went from 400Mb to 15Mb. And my text search times went down to a near constant 10ms for all queries. I can then maintain the textvectors by modifying the trigger on pagecontent a bit to cleanup the previous record while updating the table.
If I'm posting in the wrong place, please tell me. If the idea is stupid, please tell me also :)
Cheers
Marc Cousin
On Jul 18, 2008, at 16:03, Marc Cousin mcousin@sigma.fr
EXPLAIN ANALYZE SELECT page_id, page_namespace, page_title, old_text AS page_text, ts_rank(titlevector, to_tsquery('default','postgres')) AS rnk FROM page p, revision r, pagecontent c WHERE p.page_latest = r.rev_id AND r.rev_text_id = c.old_id AND textvector @@ to_tsquery('default','postgres') AND page_is_redirect = '0' AND page_namespace IN (0,9,11) ORDER BY rnk DESC, page_id DESC LIMIT 20 OFFSET 0;
Hmm; a fundamental problem here is that text.old_text is opaque to the database -- it may contain compressed text, text in an alternate encoding, or a reference to another table or database at the application level.
This is part of why we use a separate table for the search index for the MySQL search -- it's actually consistently readable to the db's search engine as well as providing a summary table so the index only has to cover live versions.
-- brion
The plan is this one :
Limit (cost=718.49..718.50 rows=1 width=621) (actual time=305.943..305.982 rows=20 loops=1) -> Sort (cost=718.49..718.50 rows=1 width=621) (actual time=305.939..305.952 rows=20 loops=1) Sort Key: rank(p.titlevector, '''postgr'''::tsquery), p.page_id -> Nested Loop (cost=0.00..718.48 rows=1 width=621) (actual time=4.278..305.671 rows=44 loops=1) -> Nested Loop (cost=0.00..695.00 rows=21 width=204) (actual time=0.829..76.740 rows=3210 loops=1) -> Seq Scan on page p (cost=0.00..524.95 rows=21 width=204) (actual time=0.804..19.686 rows=3210 loops=1) Filter: (((page_is_redirect)::text = '0'::text) AND (page_namespace = ANY ('{0,9,11}'::integer[]))) -> Index Scan using revision_rev_id_key on revision r (cost=0.00..8.09 rows=1 width=8) (actual time=0.012..0.013 rows=1 loops=3210) Index Cond: (p.page_latest = r.rev_id) -> Index Scan using pagecontent_pkey on pagecontent c (cost=0.00..1.11 rows=1 width=425) (actual time=0.069..0.069 rows=0 loops=3210) Index Cond: (r.rev_text_id = c.old_id) Filter: (textvector @@ '''postgr'''::tsquery) Total runtime: 306.118 ms
This plan joins page and revision to determine all the latests revisions of pagecontents, then scans all matching pagecontent to determine which ones match my query.
There is also an other plan, depending on the amount of ram available and the estimate of the number of 'latest pagecontents'
Limit (cost=2979.49..2979.50 rows=4 width=504) (actual time=224.594..224.646 rows=20 loops=1) -> Sort (cost=2979.49..2979.50 rows=4 width=504) (actual time=224.591..224.610 rows=20 loops=1) Sort Key: (ts_rank(p.titlevector, '''postgr'''::tsquery)), p.page_id Sort Method: top-N heapsort Memory: 37kB -> Hash Join (cost=2689.31..2979.45 rows=4 width=504) (actual time=211.141..224.432 rows=43 loops=1) Hash Cond: (p.page_latest = r.rev_id) -> Seq Scan on page p (cost=0.00..276.86 rows=3527 width=82) (actual time=0.460..10.202 rows=3118 loops=1) Filter: ((page_is_redirect = '0'::bpchar) AND (page_namespace = ANY ('{0,9,11}'::integer[]))) -> Hash (cost=2688.26..2688.26 rows=84 width=430) (actual time=210.409..210.409 rows=1517 loops=1) -> Hash Join (cost=534.76..2688.26 rows=84 width=430) (actual time=26.557..207.725 rows=1517 loops=1) Hash Cond: (r.rev_text_id = c.old_id) -> Seq Scan on revision r (cost=0.00..1836.94 rows=84194 width=8) (actual time=0.023..98.850 rows=84194 loops=1) -> Hash (cost=533.59..533.59 rows=93 width=430) (actual time=18.182..18.182 rows=1515 loops=1) -> Bitmap Heap Scan on pagecontent c (cost=190.83..533.59 rows=93 width=430) (actual time=0.585..15.663 rows=1515 loops=1) Recheck Cond: (textvector @@ '''postgr'''::tsquery) -> Bitmap Index Scan on ts2_page_text2 (cost=0.00..190.81 rows=93 width=0) (actual time=0.431..0.431 rows=1515 loops=1) Index Cond: (textvector @@ '''postgr'''::tsquery) Total runtime: 224.765 ms
Times are different because this machine is much more powerful.
This time, postgresql decides to get all articles from pagecontent, with all versions, and then determines which ones are latest.
In both cases this is rather inefficient, as I guess we search only on the latest version of the articles.
So I'm coming to the point ...
Is there a reason we index every version of every content ?
For instance, with my database, I've tested removing all textvectors from pagecontent except for the latest version of each page. My text index size went from 400Mb to 15Mb. And my text search times went down to a near constant 10ms for all queries. I can then maintain the textvectors by modifying the trigger on pagecontent a bit to cleanup the previous record while updating the table.
If I'm posting in the wrong place, please tell me. If the idea is stupid, please tell me also :)
Cheers
Marc Cousin
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
On Saturday 19 July 2008 15:44:51 Brion Vibber wrote:
On Jul 18, 2008, at 16:03, Marc Cousin mcousin@sigma.fr
EXPLAIN ANALYZE SELECT page_id, page_namespace, page_title, old_text AS page_text, ts_rank(titlevector, to_tsquery('default','postgres')) AS rnk FROM page p, revision r, pagecontent c WHERE p.page_latest = r.rev_id AND r.rev_text_id = c.old_id AND textvector @@ to_tsquery('default','postgres') AND page_is_redirect = '0' AND page_namespace IN (0,9,11) ORDER BY rnk DESC, page_id DESC LIMIT 20 OFFSET 0;
Hmm; a fundamental problem here is that text.old_text is opaque to the database -- it may contain compressed text, text in an alternate encoding, or a reference to another table or database at the application level.
This is part of why we use a separate table for the search index for the MySQL search -- it's actually consistently readable to the db's search engine as well as providing a summary table so the index only has to cover live versions.
-- brion
Hi,
Thanks for the answer...
What do you mean by 'opaque' ?
Btw the separate table for mysql solves the problem I'm having with postgresql (except that there is not the myisam problem, so the fulltext index could be stored directly in the 'page' table with postgresql). For a wiki like mine, with a lot of versions per document, text search performance becomes very bad.
So can you explain me what you mean by opaque ?
Thanks a lot for your time.
Marc
By 'opaque' he means that the databases are supposed to handle that field transparently. They are supposed to send the data raw to the app and act as if they have absolutely no clue how to handle the data inside of that field.
The reason being the data inside of that column is inconstant and letting the database modify or pretend it knows what is inside there can end up with real unexpected results.
Ok, I now understand the meaning of opaque, even if that data isn't that opaque, because the database and mediawiki are using it for full text indexing.
I totally agree with you, that would be a really bad idea to try to mess with the content of the page (that wasn't what I was talking about).
Currently, for postgresql there is a trigger updating the 'textvector' column in pagecontent table when 'old_text' is modified, so that the text search index is up to date.
The problem is that the textvector should only exist for the latest version of the pagecontent. There is no point in indexing the old versions of the contents of the page.
That's done for mysql because of the 1:1 relationship between page and searchindex, but isn't done for postgresql.
For PostgreSQL, there are at least 2 solutions : - Put the textvector into the page table - Do the same as for mysql : create a searchindex table (even if the myisam justification for this table doesn't hold for mysql)
Plus the temporary fix I was talking about in my mail : put an empty textvector for all versions of a document except its last one, by changing the trigger a bit. That divided the size of my fulltext index by 30 and made the text search fast again.
What would be the preferred way to solve this problem ?
On Monday 21 July 2008 19:47:09 Daniel Friesen wrote:
By 'opaque' he means that the databases are supposed to handle that field transparently. They are supposed to send the data raw to the app and act as if they have absolutely no clue how to handle the data inside of that field.
The reason being the data inside of that column is inconstant and letting the database modify or pretend it knows what is inside there can end up with real unexpected results.
Marc Cousin wrote:
For PostgreSQL, there are at least 2 solutions :
- Put the textvector into the page table
- Do the same as for mysql : create a searchindex table (even if the myisam
justification for this table doesn't hold for mysql)
Plus the temporary fix I was talking about in my mail : put an empty textvector for all versions of a document except its last one, by changing the trigger a bit. That divided the size of my fulltext index by 30 and made the text search fast again.
What would be the preferred way to solve this problem ?
Probably to create a new table. The less you differ from the mysql setup, the better.
I totally agree with you.
Is there a way of getting it in a future mediawiki release (I can do the work if necessary) ?
On Friday 25 July 2008 17:05:48 Platonides wrote:
Marc Cousin wrote:
For PostgreSQL, there are at least 2 solutions :
- Put the textvector into the page table
- Do the same as for mysql : create a searchindex table (even if the
myisam justification for this table doesn't hold for mysql)
Plus the temporary fix I was talking about in my mail : put an empty textvector for all versions of a document except its last one, by changing the trigger a bit. That divided the size of my fulltext index by 30 and made the text search fast again.
What would be the preferred way to solve this problem ?
Probably to create a new table. The less you differ from the mysql setup, the better.
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Marc Cousin wrote:
I totally agree with you.
Is there a way of getting it in a future mediawiki release (I can do the work if necessary) ?
Sure. Ask brion for a SVN account (you'll need to send him a public key). The postgresql backend is not too well maintained, people wanting to postgre appear from time to time and give it a kick. I don't remember who is currently supposed to be taking care of it, but i interpret that as he hasn't commented on this, he doesn't oppose ;)
On Monday 28 July 2008 13:34:41 Platonides wrote:
Marc Cousin wrote:
I totally agree with you.
Is there a way of getting it in a future mediawiki release (I can do the work if necessary) ?
Sure. Ask brion for a SVN account (you'll need to send him a public key). The postgresql backend is not too well maintained, people wanting to postgre appear from time to time and give it a kick. I don't remember who is currently supposed to be taking care of it, but i interpret that as he hasn't commented on this, he doesn't oppose ;)
Okay.
First I'd like to sum up what's to be done (please anybody tell me if I'm wrong, I'll let a few days pass before I start working).
= Current state =
MySQL's way of doing text indexation: - Articles updates are done on text (pagecontent for postgresql), and page (I of course omit the rest of the schema as it's the same as far as full text indexing is concerned) - Sometimes, there is an indexation work that locks searchindex and puts the new pages contents into it (retrieving the contents from the text table). New page contents is retrieved thanks to recentchanges.
For PostgreSQL, right now : full text data is stored into the pagecontent (text) table. So there are many unnecessary versions of full text data.
What I'd like to know is the reason for this asynchronous indexing in mysql : is it because of the myisam table for full text (to avoid locking during inserts), or for other performance reasons?
If there is no other reason than locking, there is no point in having this asynchronous job with postgresql.
= Proposal =
- create the same searchindex table in postgresql as in mysql - modify the search queries to use this new table (reuse and adapt the mysql code) - maintain searchindex synchronously with a trigger. I would think the best place to put this trigger would be on the page table. The trigger should update searchindex when page_latest is put to a non zero value (I think the trigger will have everything it needs when this trigger fires). - create a migration script for the database : - remove unnecessary columns and indexes on text table, and its trigger. - remove titlevector from title and its trigger. - add pagecontent and put all page records in it, then put gin indexes (on title and on content) - put the trigger in place
Comments on this, please ?
Cheers
Marc
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Marc Cousin wrote:
MySQL's way of doing text indexation:
- Articles updates are done on text (pagecontent for postgresql), and page (I
of course omit the rest of the schema as it's the same as far as full text indexing is concerned)
- Sometimes, there is an indexation work that locks searchindex and puts the
new pages contents into it (retrieving the contents from the text table). New page contents is retrieved thanks to recentchanges.
Well, no -- the search index is updated on page save, unless you go to a lot of (usually unnecessary) trouble to disable it and do bulk updates.
Here's what happens normally:
Article::editUpdates() puts a SearchUpdate object on the post-edit updates stack when the edit is complete. This gets run at the completion of the request, along with many other updates (links tables, view counts, etc).
SearchUpdate::doUpdate() takes the text, does some freaky markup normalization, and sends it off to the search engine's SearchEngine::update() or SeachEngine::updateTitle() methods.
The SearchEngine subclass for the relevant engine then updates its index for the given page. In the case of the MySQL backend, this creates or updates a row in the searchindex table.
What I'd like to know is the reason for this asynchronous indexing in mysql : is it because of the myisam table for full text (to avoid locking during inserts), or for other performance reasons?
If you mean the non-default option to disable updates on edit, that would be because MyISAM updates cause locking trouble if your site is very busy, so you might prefer to do updates in bulk off-peak with the search temporarily disabled.
However this is not the default.
- maintain searchindex synchronously with a trigger. I would think the best
place to put this trigger would be on the page table. The trigger should update searchindex when page_latest is put to a non zero value (I think the trigger will have everything it needs when this trigger fires).
Can't do it -- the database has insufficient knowledge to interpret the contents of page text in the general case.
* It doesn't know how to uncompress compressed revisions * It doesn't know how to access text in external storage blobs * It doesn't know the namespaces etc for proper markup normalization
Since the database can't read page text as a general case, this has to be done from the application layer.
- -- brion
On Monday 28 July 2008 20:39:46 Brion Vibber wrote:
Marc Cousin wrote:
MySQL's way of doing text indexation:
- Articles updates are done on text (pagecontent for postgresql), and
page (I of course omit the rest of the schema as it's the same as far as full text indexing is concerned)
- Sometimes, there is an indexation work that locks searchindex and puts
the new pages contents into it (retrieving the contents from the text table). New page contents is retrieved thanks to recentchanges.
Well, no -- the search index is updated on page save, unless you go to a lot of (usually unnecessary) trouble to disable it and do bulk updates.
Here's what happens normally:
Article::editUpdates() puts a SearchUpdate object on the post-edit updates stack when the edit is complete. This gets run at the completion of the request, along with many other updates (links tables, view counts, etc).
SearchUpdate::doUpdate() takes the text, does some freaky markup normalization, and sends it off to the search engine's SearchEngine::update() or SeachEngine::updateTitle() methods.
The SearchEngine subclass for the relevant engine then updates its index for the given page. In the case of the MySQL backend, this creates or updates a row in the searchindex table.
What I'd like to know is the reason for this asynchronous indexing in mysql : is it because of the myisam table for full text (to avoid locking during inserts), or for other performance reasons?
If you mean the non-default option to disable updates on edit, that would be because MyISAM updates cause locking trouble if your site is very busy, so you might prefer to do updates in bulk off-peak with the search temporarily disabled.
However this is not the default.
- maintain searchindex synchronously with a trigger. I would think the
best place to put this trigger would be on the page table. The trigger should update searchindex when page_latest is put to a non zero value (I think the trigger will have everything it needs when this trigger fires).
Can't do it -- the database has insufficient knowledge to interpret the contents of page text in the general case.
- It doesn't know how to uncompress compressed revisions
- It doesn't know how to access text in external storage blobs
- It doesn't know the namespaces etc for proper markup normalization
Since the database can't read page text as a general case, this has to be done from the application layer.
Ok. It's not the case right now with the PostgreSQL code : the database maintains it's text index by itself via trigger. So doing the changes the mysql way may correct other problems at the same time ?
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Marc Cousin wrote:
On Monday 28 July 2008 20:39:46 Brion Vibber wrote:
[snip]
Since the database can't read page text as a general case, this has to be done from the application layer.
Ok. It's not the case right now with the PostgreSQL code : the database maintains it's text index by itself via trigger.
... which does not work as a general case as I've mentioned many times in this thread so far ...
So doing the changes the mysql way may correct other problems at the same time ?
There's nothing MySQL-specific about doing it in a way that works. :)
- -- brion
mediawiki-l@lists.wikimedia.org