I presume by full text search you mean the Lucene search engine, which uses
the Vector Space Model?
If you know a bit about Lucene, you wouldn't be surprised by what they've
done.
If you log the output of lsearchd, you can see how it blows up queries:
original query: prefrontal cortex hippocampus
query=[prefrontal cortex hippocampus] parsed=[(+(contents:prefrontal
contents:prefront^0.5)
+contents:cortex +contents:hippocampus) ((+title:prefrontal^6.0 +title:
cortex^6.0 +title:hippocampus^6.0) (+(stemtitle:prefrontal^2.0
stemtitle:prefront^0.8) +stemtitle:cortex^2.0 +stemtitle:hippocampus^2.0))
((+alttitle1:prefrontal^4.0 +alttitle1:cortex^4.0 +alttitle1:hippocampus^4.0)
(+alttitle2:prefrontal^4.0 +alttitle2:cortex^4.0 +alttitle2:hippocampus^4.0)
(+alttitle3:prefrontal^4.0 +alttitle3:cortex^4.0 +alttitle3:hippocampus
^4.0))]
If you compare that to what pubmed does for the same query:
("prefrontal cortex"[MeSH Terms] OR ("prefrontal"[All Fields] AND
"cortex"[All
Fields]) OR "prefrontal cortex"[All Fields]) AND ("hippocampus"[MeSH
Terms]
OR "hippocampus"[All Fields])
On Thu, Jan 8, 2009 at 11:22 AM, Brion Vibber <brion(a)wikimedia.org> wrote:
On 1/8/09 7:47 AM, Uwe Baumbach wrote:
Hi,
is there a comprehensive, reliable, more profound description of the
logical steps the internal search engine (or parser before the engine)
undertakes to define:
- what is recognized as a single word in an entered search string
(blanks - OK, but what about slash, back slash, hyphen, period?) ?
Check MySQL's documentation; also try diving through SearchMySQL.php to
check how it's breaking up the input when rendering its output. Also
check Language.php for the horrid search tweaking code.
- what are "similar words" (closeness
of words) ?
No such metric exists afaik.
...) tell more or less and then different things
too.
Note that Wikimedia's sites use a different search engine (MWSearch
extension plus our Lucene-based backend), so descriptions of their
behavior would not necessarily be what you want if you're looking for
descriptions of the default MySQL backend. Note also that the PostgreSQL
backend is different.
-- brion
_______________________________________________
MediaWiki-l mailing list
MediaWiki-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
--
You have successfully failed!