On 10/25/21 2:22 AM, David Causse wrote:
...
On Mon, Oct 25, 2021 at 6:37 AM Matto Marjanovic
<maddog(a)mir.com <mailto:maddog@mir.com>> wrote:
...
It would be a-ok if the 'more_file_text' could just be treated as additional
content for the 'file_text' field. (However, simply populating the existing
'file_text' field via the SearchDataForIndexHook does not work, because the
FileContentHandler::getDataForSearchIndex() method runs after the hook and
always forcefully overwrites the 'file_text' field.)
This should be do-able by implementing the CirrusSearchBuildDocumentParse hook which runs
very late in the process (see cirrus doc under docs/hooks.txt).
It could be only CirrusSearchBuildDocumentParse if you have the data at hand when this
hook runs or a combination of SearchDataForIndexHook to populate a
"more_file_text" field like you do + CirrusSearchBuildDocumentParse to append
this "more_file_text" to the existing "file_text" and possibly empty
the "more_file_text" field if you no longer need it.
I guess I should point out that I am working with MediaWiki 1.35 (and beyond)...
Alas, it seems that between 1.31 and 1.35, the CirrusSearchBuildDocumentParse
hook was removed, and then reinstated very *early* in the process. It is now
run in BuildDocument::initialize(), even before the SearchDataForIndexHook.
So, again, anything it does to the 'file_text' field will just get stomped on
by the FileContentHandler later on. (And, comments in BuildDocument say
"Use of this hook is deprecated ... restoring this hook is a temporary hack
for WikibaseMediaInfo", so I wouldn't want to depend on it moving forward.)
There are probably more ways to achieve what you want
with greater
control of the ranking but this will probably be much more involved
(i.e. writing your own search query builder).
If only the SearchDataForIndexHook were properly run late, this would be so
simple....
-mm
ps: There are a bunch of broken things in all this code:
o SearchDataForIndexHook is run by getDataForSearchIndex() in each
ContentHandler, but the design ensures that it is run at some
ambiguous place in the middle of getDataForSearchIndex(), instead
of resolutely at the end.
o SearchIndexFieldsHook *is not* run by getFieldsForSearchIndex() of a
ContentHandler. This means that CirrusSearch\BuildDocument never
sees the definitions for any fields added by the hook. The only code
in CirrusSearch that does see the definitions is MappingConfigBuilder
(maintenance code).
I suppose I should figure out how to file a bug report somewhere.