FWIW, we do index the full text of (PDF and?) DjVu
files on Commons
(because it's stored in img_metadata). It's probably the biggest
improvement CirrusSearch brought for Commons.
And we also index office documents via Tika (*.doc and similar).
And I think it should not be a feature of the search engine at all! It's
a separate feature that's completely independent of the search engine
used (that's how it's implemented in my TikaMW).
So, is there any replacement for the SearchUpdate hook to modify the
indexed text?
Of course I can just return SearchUpdate back by including a patch in
our distribution mediawiki4intranet, but I would prefer if TikaMW didn't
require patching...