On 4/5/07, Domas Mituzas <midom.lists(a)gmail.com> wrote:
haha, yeah, thats how some fuzzy search methods work.
Have you tried
performance of such an index - would be interesting to find out it
could be viable solution :)
of course, it would be better to put 'png' and 'jpg' into stopwords
(ergh, or strip file extensions entirely).
Yes, I know. I carry a trigram index on page_titles in my local
database, for fuzzy union based title searches. It's very fast when
storing the trigrams in an inverted index.
wikidb=> CREATE INDEX page_page_title_gintgm ON page USING gin
(page_title gin_trgm_ops) where page_namespace=0;
CREATE INDEX
wikidb=> explain ANALYZE select page_title from page where page_title
% 'evil' and page_namespace=0;
QUERY
PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on page (cost=109.19..8578.91 rows=2491 width=21)
(actual time=142.416..370.970 rows=262 loops=1)
Recheck Cond: (page_namespace = 0)
Filter: (page_title % 'evil'::pg_catalog.text)
-> Bitmap Index Scan on page_page_title_gintgm (cost=0.00..108.56
rows=2491 width=0) (actual time=138.176..138.176 rows=15258 loops=1)
Index Cond: (page_title % 'evil'::pg_catalog.text)
Total runtime: 371.636 ms
(6 rows)
wikidb=> select page_title from page where page_title % 'evil' and
page_namespace=0 limit 10;
page_title
-----------------
Devil
The_Evil_Dead
EV
Evil
Evil_Dead_II
Problem_of_evil
Axis_of_evil
Evil_empire
Evil_clown
Touch_of_Evil
(10 rows)
wikidb=> explain ANALYZE select
page_title,similarity(page_title,'evil') as sim from page where
page_title % 'evil' and page_namespace=0 order by sim desc,page_title
limit 9;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=8725.66..8725.68 rows=9 width=21) (actual
time=365.257..365.299 rows=9 loops=1)
-> Sort (cost=8725.66..8731.89 rows=2491 width=21) (actual
time=365.252..365.266 rows=9 loops=1)
Sort Key: (similarity(page_title, 'evil'::pg_catalog.text)), page_title
-> Bitmap Heap Scan on page (cost=109.19..8585.14 rows=2491
width=21) (actual time=142.305..363.793 rows=262 loops=1)
Recheck Cond: (page_namespace = 0)
Filter: (page_title % 'evil'::pg_catalog.text)
-> Bitmap Index Scan on page_page_title_gintgm
(cost=0.00..108.56 rows=2491 width=0) (actual time=138.125..138.125
rows=15258 loops=1)
Index Cond: (page_title % 'evil'::pg_catalog.text)
Total runtime: 365.435 ms
(9 rows)
wikidb=> select page_title,similarity(page_title,'evil') as sim from
page where page_title % 'evil' and page_namespace=0 order by sim
desc,page_title limit 9; page_title | sim
------------+----------
Evil | 1
Evil_E | 0.833333
Dr_Evil | 0.625
Dr._Evil | 0.625
Evil_eye | 0.625
Evil_Eye | 0.625
WP:EVIL | 0.625
Evile | 0.571429
Evil_Ash | 0.555556
(9 rows)
wikidb=> EXPLAIN ANALYZE select
page_title,similarity(page_title,'effencicy') as sim from page where
page_title % 'effencicy' and page_namespace=0 and page_is_redirect=0
order by sim desc,page_title limit 5;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=8590.99..8591.00 rows=5 width=21) (actual
time=368.520..368.541 rows=5 loops=1)
-> Sort (cost=8590.99..8591.02 rows=12 width=21) (actual
time=368.514..368.521 rows=5 loops=1)
Sort Key: (similarity(page_title,
'effencicy'::pg_catalog.text)), page_title
-> Bitmap Heap Scan on page (cost=108.57..8590.77 rows=12
width=21) (actual time=343.597..368.475 rows=5 loops=1)
Recheck Cond: (page_namespace = 0)
Filter: ((page_title % 'effencicy'::pg_catalog.text)
AND ((page_is_redirect)::pg_catalog.text = '0'::pg_catalog.text))
-> Bitmap Index Scan on page_page_title_gintgm
(cost=0.00..108.56 rows=2491 width=0) (actual time=340.771..340.771
rows=3640 loops=1)
Index Cond: (page_title % 'effencicy'::pg_catalog.text)
Total runtime: 368.614 ms
(9 rows)
wikidb=> select page_title,similarity(page_title,'effencicy') as sim
from page where page_title % 'effencicy' and page_namespace=0 and
page_is_redirect=0 order by sim desc,page_title limit 5;
page_title | sim
----------------+----------
Effendi | 0.384615
Recency_effect | 0.315789
Effeminacy | 0.3125
Efficiency | 0.3125
Effect | 0.307692
(5 rows)
:)