Hi,
is there a comprehensive, reliable, more profound description of the logical steps the internal search engine (or parser before the engine) undertakes to define: - what is recognized as a single word in an entered search string (blanks - OK, but what about slash, back slash, hyphen, period?) ? - what are "similar words" (closeness of words) ?
Different sources (www.mediawiki.org, xy.wikipedia.org/wiki/Help:Search, ...) tell more or less and then different things too. :-(
Thank you all around - who helps.
On 1/8/09 7:47 AM, Uwe Baumbach wrote:
Hi,
is there a comprehensive, reliable, more profound description of the logical steps the internal search engine (or parser before the engine) undertakes to define:
- what is recognized as a single word in an entered search string
(blanks - OK, but what about slash, back slash, hyphen, period?) ?
Check MySQL's documentation; also try diving through SearchMySQL.php to check how it's breaking up the input when rendering its output. Also check Language.php for the horrid search tweaking code.
- what are "similar words" (closeness of words) ?
No such metric exists afaik.
Different sources (www.mediawiki.org, xy.wikipedia.org/wiki/Help:Search, ...) tell more or less and then different things too.
Note that Wikimedia's sites use a different search engine (MWSearch extension plus our Lucene-based backend), so descriptions of their behavior would not necessarily be what you want if you're looking for descriptions of the default MySQL backend. Note also that the PostgreSQL backend is different.
-- brion
I presume by full text search you mean the Lucene search engine, which uses the Vector Space Model? If you know a bit about Lucene, you wouldn't be surprised by what they've done.
If you log the output of lsearchd, you can see how it blows up queries:
original query: prefrontal cortex hippocampus
query=[prefrontal cortex hippocampus] parsed=[(+(contents:prefrontal contents:prefront^0.5) +contents:cortex +contents:hippocampus) ((+title:prefrontal^6.0 +title: cortex^6.0 +title:hippocampus^6.0) (+(stemtitle:prefrontal^2.0 stemtitle:prefront^0.8) +stemtitle:cortex^2.0 +stemtitle:hippocampus^2.0)) ((+alttitle1:prefrontal^4.0 +alttitle1:cortex^4.0 +alttitle1:hippocampus^4.0) (+alttitle2:prefrontal^4.0 +alttitle2:cortex^4.0 +alttitle2:hippocampus^4.0) (+alttitle3:prefrontal^4.0 +alttitle3:cortex^4.0 +alttitle3:hippocampus ^4.0))]
If you compare that to what pubmed does for the same query:
("prefrontal cortex"[MeSH Terms] OR ("prefrontal"[All Fields] AND "cortex"[All Fields]) OR "prefrontal cortex"[All Fields]) AND ("hippocampus"[MeSH Terms] OR "hippocampus"[All Fields])
On Thu, Jan 8, 2009 at 11:22 AM, Brion Vibber brion@wikimedia.org wrote:
On 1/8/09 7:47 AM, Uwe Baumbach wrote:
Hi,
is there a comprehensive, reliable, more profound description of the logical steps the internal search engine (or parser before the engine) undertakes to define:
- what is recognized as a single word in an entered search string
(blanks - OK, but what about slash, back slash, hyphen, period?) ?
Check MySQL's documentation; also try diving through SearchMySQL.php to check how it's breaking up the input when rendering its output. Also check Language.php for the horrid search tweaking code.
- what are "similar words" (closeness of words) ?
No such metric exists afaik.
Different sources (www.mediawiki.org, xy.wikipedia.org/wiki/Help:Search,
...) tell more or less and then different things too.
Note that Wikimedia's sites use a different search engine (MWSearch extension plus our Lucene-based backend), so descriptions of their behavior would not necessarily be what you want if you're looking for descriptions of the default MySQL backend. Note also that the PostgreSQL backend is different.
-- brion
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
mediawiki-l@lists.wikimedia.org