Hi!
Change https://gerrit.wikimedia.org/r/#/c/79025/ that was merged to 1.22 breaks my TikaMW extension - I used that hook to extract contents from binary files so the user can then search on it.
Maybe you can add some other hook for this purpose?
See also https://github.com/mediawiki4intranet/TikaMW/issues/2
On Tue, Jan 14, 2014 at 2:33 PM, vitalif@yourcmc.ru wrote:
Hi!
Change https://gerrit.wikimedia.org/r/#/c/79025/ that was merged to 1.22 breaks my TikaMW extension - I used that hook to extract contents from binary files so the user can then search on it.
Maybe you can add some other hook for this purpose?
See also https://github.com/mediawiki4intranet/TikaMW/issues/2
SearchEngine subclasses can implement getTextFromContent() if they want to override the normal text fetching behavior.
-Chad
SearchEngine subclasses can implement getTextFromContent() if they want to override the normal text fetching behavior.
I can't put it into SearchEngine subclass because Tika isn't a search engine, it's rather a java application that runs separately and extracts text from binary files like *.doc, *.pdf and so on.
TikaMW is a plugin that should work with any search engine - it just modifies indexed text for pages in File: namespace.
On Wed, Jan 15, 2014 at 12:07 AM, Vitaliy Filippov vitalif@yourcmc.ruwrote:
SearchEngine subclasses can implement getTextFromContent() if they want to
override the normal text fetching behavior.
I can't put it into SearchEngine subclass because Tika isn't a search engine, it's rather a java application that runs separately and extracts text from binary files like *.doc, *.pdf and so on.
TikaMW is a plugin that should work with any search engine - it just modifies indexed text for pages in File: namespace.
The problem is you can't make that assumption. Different search indexes treat text in different ways, and munging them into the same content field won't allow them to do the right thing. With lsearchd and Elasticsearch, we absolutely wouldn't want to munge file text into page content (with sql-backed search, you might maybe).
Most of the code in the SearchEngine and related classes is infrastructure for the sql-backed options, leaving MWSearch and CirrusSearch to reinvent a lot of wheels. If we cleaned this up a bunch (I want to do this anyway, but time) we might be able to add a hook back in that only affects search engines that are implementing the core sql search...
-Chad
wikitech-l@lists.wikimedia.org