Need a way to modify text before indexing (was SearchUpdate) - Wikitech-l - lists.wikimedia.org

List overview All Threads
Download

Need a way to modify text before indexing (was SearchUpdate)

FOSDEM coordination

PHPUnit versioning

vitalif＠yourcmc.ru

14 Jan 2014 14 Jan '14

10:33 p.m.

Hi! Change https://gerrit.wikimedia.org/r/#/c/79025/ that was merged to 1.22 breaks my TikaMW extension - I used that hook to extract contents from binary files so the user can then search on it. Maybe you can add some other hook for this purpose? See also https://github.com/mediawiki4intranet/TikaMW/issues/2

Reply

Show replies by thread

Chad

14 Jan 14 Jan

11:48 p.m.

On Tue, Jan 14, 2014 at 2:33 PM, <vitalif(a)yourcmc.ru> wrote:

Hi! Change https://gerrit.wikimedia.org/r/#/c/79025/ that was merged to 1.22 breaks my TikaMW extension - I used that hook to extract contents from binary files so the user can then search on it. Maybe you can add some other hook for this purpose? See also https://github.com/mediawiki4intranet/TikaMW/issues/2

SearchEngine subclasses can implement getTextFromContent() if they want to override the normal text fetching behavior. -Chad

Reply

Vitaliy Filippov

15 Jan 15 Jan

8:07 a.m.

SearchEngine subclasses can implement getTextFromContent() if they want to override the normal text fetching behavior.

I can't put it into SearchEngine subclass because Tika isn't a search engine, it's rather a java application that runs separately and extracts text from binary files like *.doc, *.pdf and so on. TikaMW is a plugin that should work with any search engine - it just modifies indexed text for pages in File: namespace. -- With best regards, Vitaliy Filippov

Reply

Chad

5:41 p.m.

On Wed, Jan 15, 2014 at 12:07 AM, Vitaliy Filippov <vitalif(a)yourcmc.ru>wrote;wrote:

SearchEngine subclasses can implement getTextFromContent() if they want to

override the normal text fetching behavior.

I can't put it into SearchEngine subclass because Tika isn't a search engine, it's rather a java application that runs separately and extracts text from binary files like *.doc, *.pdf and so on. TikaMW is a plugin that should work with any search engine - it just modifies indexed text for pages in File: namespace.

The problem is you can't make that assumption. Different search indexes treat text in different ways, and munging them into the same content field won't allow them to do the right thing. With lsearchd and Elasticsearch, we absolutely wouldn't want to munge file text into page content (with sql-backed search, you might maybe). Most of the code in the SearchEngine and related classes is infrastructure for the sql-backed options, leaving MWSearch and CirrusSearch to reinvent a lot of wheels. If we cleaned this up a bunch (I want to do this anyway, but time) we might be able to add a hook back in that only affects search engines that are implementing the core sql search... -Chad

Reply

3755

days inactive

3756

days old

wikitech-l@lists.wikimedia.org

Manage subscription

3 comments

3 participants

tags (0)

participants (3)

Chad
vitalif＠yourcmc.ru
Vitaliy Filippov