On 4/7/07, Domas Mituzas <midom.lists(a)gmail.com> wrote:
Was that complete enwiki set?
Yes.
I'd expect that was not loaded box with all free
resources, so 400ms
figure is not that low for the main search. The task isn't the
lightest either.
[gmaxwell@floodlamp ~]$ uptime
17:38:31 up 56 days, 23:28, 10 users, load average: 6.03, 6.01, 6.08
So.. not unloaded. But yet, would be nicer if it was 30 ms like the
category intersections on this box. Reality is that trigram indices
are huge, there are are about 100 million entries in the page_title
trigram index I have. ((mean number of characters per title + 2) *
number of titles)
Some of our titles are stupidly long... [[List of things that are far
longer than you expected to be in a wikipedia article title]] ...
resulting in an utterly huge trigram index. To cut down on this
problem, I've tokenized the titles, and created the trigram index over
the unique words.
It's now quite fast:
wikidb=> explain ANALYZE select word,similarity(word,'evil') as sim
from title_words where word % 'evil' order by sim desc;
QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=1317.06..1318.29 rows=492 width=12) (actual
time=16.131..16.256 rows=75 loops=1)
Sort Key: (similarity(word, 'evil'::pg_catalog.text))
-> Bitmap Heap Scan on title_words (cost=40.38..1295.07 rows=492
width=12) (actual time=12.038..15.945 rows=75 loops=1)
Filter: (word % 'evil'::pg_catalog.text)
-> Bitmap Index Scan on title_words_trgm (cost=0.00..40.26
rows=492 width=0) (actual time=11.913..11.913 rows=1334 loops=1)
Index Cond: (word % 'evil'::pg_catalog.text)
Total runtime: 16.435 ms
wikidb=> select word,similarity(word,'evil') as sim from title_words
where word % 'evil' order by sim desc limit 10;
word | sim
-----------+----------
evil | 1
evilä | 1
vs._evil | 0.625
evils | 0.571429
evil-lyn | 0.555556
evilla | 0.5
evi | 0.5
live-evil | 0.5
eville | 0.5
evilwm | 0.5
(10 rows)
This can easily be used to create a quite fast article suggestion engine:
wikidb=> explain ANALYZE select word,array((select
'[['||page_title||']]' from article_titles where vector @@
to_tsquery('simple',word) and page_is_redirect=0 order by
length(page_title) limit 5)) as pages,similarity(word,'efficiancy') as
sim from title_words where word % 'efficiancy' order by
dmetaphone(word)=dmetaphone('efficiancy') desc, sim desc;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=4190095.55..4190096.78 rows=492 width=12) (actual
time=47.521..47.540 rows=11 loops=1)
Sort Key: ((dmetaphone(title_words.word) =
'AFSN'::pg_catalog.text)), (similarity(title_words.word,
'efficiancy'::pg_catalog.text))
-> Bitmap Heap Scan on title_words (cost=40.38..4190073.55
rows=492 width=12) (actual time=42.909..47.477 rows=11 loops=1)
Filter: (word % 'efficiancy'::pg_catalog.text)
-> Bitmap Index Scan on title_words_trgm (cost=0.00..40.26
rows=492 width=0) (actual time=42.713..42.713 rows=355 loops=1)
Index Cond: (word % 'efficiancy'::pg_catalog.text)
SubPlan
-> Limit (cost=8513.76..8513.77 rows=5 width=21) (actual
time=0.254..0.266 rows=3 loops=11)
-> Sort (cost=8513.76..8513.79 rows=12 width=21)
(actual time=0.250..0.254 rows=3 loops=11)
Sort Key: (length(article_titles.page_title))
-> Bitmap Heap Scan on article_titles
(cost=99.34..8513.54 rows=12 width=21) (actual time=0.168..0.212
rows=7 loops=11)
Recheck Cond: (vector @@
to_tsquery('simple'::pg_catalog.text, $0))
Filter:
((page_is_redirect)::pg_catalog.text = '0'::pg_catalog.text)
-> Bitmap Index Scan on
article_titles_unstemmedwords_gin (cost=0.00..99.33 rows=2439
width=0) (actual time=0.158..0.158 rows=15 loops=11)
Index Cond: (vector @@
to_tsquery('simple'::pg_catalog.text, $0))
Total runtime: 47.632 ms
wikidb=> select word,array((select '[['||page_title||']]' from
article_titles where vector @@ to_tsquery('simple',word) and
page_is_redirect=0 order by length(page_title) limit 5)) as
pages,similarity(word,'efficiancy') as sim from title_words where word
% 'efficiancy' order by dmetaphone(word)=dmetaphone('efficiancy')
desc, sim desc;
word |
pages |
sim
----------------+---------------------------------------------------------------------------------------------------------------------------+----------
efficiency |
{[[Efficiency]],[[X-efficiency]],[[Eco-efficiency]],[[Fuel_efficiency]],[[Atom_efficiency]]}
| 0.571429
efficient |
{[[Efficient_Cause]],[[Efficient_breach]],[[Efficient_rationing]],[[Fuel_efficient_driving]],[[Efficient_Market_Canada]]}
| 0.4
efficeon | {[[Efficeon]]}
|
0.333333
x-efficiency | {[[X-efficiency]]}
|
0.5
eco-efficiency | {[[Eco-efficiency]]}
|
0.470588
efficacy |
{[[Efficacy]],[[Self-efficacy]],[[Luminous_efficacy]],[[Political_efficacy]],[[Drug_Efficacy_Study_Implementation]]}
| 0.428571
effi | {[[Effi_Briest]]}
|
0.333333
inefficiency |
{[[Inefficiency]],[[X-inefficiency]],[[Coefficient_of_Inefficiency]]}
| 0.333333
self-efficacy | {[[Self-efficacy]]}
|
0.315789
effie |
{[[EFFIE]],[[Effie]],[[Effie_Gray]],[[Effie_Crockett]],[[Effie_Bancroft]]}
| 0.307692
x-inefficiency | {[[X-inefficiency]]}
|
0.3
Testing random misspelled words (using words from redirects from
mispelling titles), I'm hitting the database with 100 lookups per
second right now and it seems to be holding up just fine. Don't want
to bother with real load testing until my other jobs are done.
Plus it gives useful results where our search gives nothing, for example:
Searching 'Nordkerchen'...
wikidb=> select word,array((select '[['||page_title||']]' from
article_titles where vector @@ to_tsquery('simple',word) and
page_is_redirect=0 order by length(page_title) limit 5)) as
pages,similarity(word,'Nordkerchen') as sim from title_words where
word % 'Nordkerchen' order by
dmetaphone(word)=dmetaphone('Nordkerchen') desc, sim desc;
word | pages
| sim
---------------+-----------------------------------------------------------------------------+----------
nordkirchen | {[[Nordkirchen]],[[Transmitter_Nordkirchen]]}
| 0.6
nordkehdingen | {[[Nordkehdingen]]}
| 0.368421
nordkapp | {[[Nordkapp]],[[Nordkapp_class_OPV]]}
| 0.3125
nordgren | {[[Niklas_Nordgren]]}
| 0.3125
nordkinn | {[[Cape_Nordkinn]]}
| 0.3125
nordkisa | {[[Nordkisa]]}
| 0.3125
kerchen | {[[Dora_NoemĂ_Kerchen]]}
| 0.428571
norden |
{[[Norden]],[[Van_Norden]],[[Norden_C.C.]],[[Norden_F.C.]],[[Carl_Norden]]}
| 0.357143
nordeen | {[[William_Nordeen]]}
| 0.333333
northen |
{[[Rob_Northen_copylock]],[[Great_Northen_and_Southern_Railway]]}
| 0.333333
scherchen | {[[Hermann_Scherchen]]}
| 0.3125
nord |
{[[Nord]],[[NORD-1]],[[NORD-10]],[[LGV_Nord]],[[Nord_262]]}
| 0.307692
Wanna make public search service? :)
Sure, can someone point me to a python (because it's my prefered quick
hack programming language) cgi that impliments an incremental AJAX
search?
I'm web-dumb, but I've pretty much got a good backend done for fast
incremental title suggestions. I'm currently importing Jeff's
'article title thesaurus' data..
I already have an old copy of internal pagerank for enwikipedia
someplace, and I'd love to get the reletative article hitrates to use
as additional ranking metrics.