On Wed, Jan 15, 2014 at 12:07 AM, Vitaliy Filippov <vitalif(a)yourcmc.ru>wrote;wrote:
SearchEngine subclasses can implement
getTextFromContent() if they want to
override the normal text fetching behavior.
I can't put it into SearchEngine subclass because Tika isn't a search
engine, it's rather a java application that runs separately and extracts
text from binary files like *.doc, *.pdf and so on.
TikaMW is a plugin that should work with any search engine - it just
modifies indexed text for pages in File: namespace.
The problem is you can't make that assumption. Different search indexes
treat text
in different ways, and munging them into the same content field won't allow
them to
do the right thing. With lsearchd and Elasticsearch, we absolutely wouldn't
want to
munge file text into page content (with sql-backed search, you might maybe).
Most of the code in the SearchEngine and related classes is infrastructure
for the
sql-backed options, leaving MWSearch and CirrusSearch to reinvent a lot of
wheels.
If we cleaned this up a bunch (I want to do this anyway, but time) we might
be able
to add a hook back in that only affects search engines that are
implementing the
core sql search...
-Chad