On Thu, Mar 7, 2013 at 11:45 AM, Daniel Kinzler <daniel(a)brightbyte.de> wrote:
1) create a specialized XML dump that contains the
text generated by
getTextForSearchIndex() instead of actual page content.
That probably makes the most sense; alternately, make a dump that
includes both "raw" data and "text for search". This also allows for
indexing extra stuff for files -- such as extracted text from a PDF of
DjVu or metadata from a JPEG -- if the dump process etc can produce
appropriate indexable data.
However, that only works
if the dump is created using the PHP dumper. How are the regular dumps currently
generated on WMF infrastructure? Also, would be be feasible to make an extra
dump just for LuceneSearch (at least for wikidata.org)?
The dumps are indeed created via MediaWiki. I think Ariel or someone
can comment with more detail on how it currently runs, it's been a
while since I was in the thick of it.
2) We could re-implement the ContentHandler facility
in Java, and require
extensions that define their own content types to provide a Java based handler
in addition to the PHP one. That seems like a pretty massive undertaking of
dubious value. But it would allow maximum control over what is indexed how.
Nooooo don't do it :)
3) The indexer code (without plugins) should not know
about Wikibase, but it may
have hard coded knowledge about JSON. It could have a special indexing mode for
JSON, in which the structure is deserialized and traversed, and any values are
added to the index (while the keys used in the structure would be ignored). We
may still be indexing useless interna from the JSON, but at least there would be
a lot fewer false negatives.
Indexing structured data could be awesome -- again I think of file
metadata as well as wikidata-style stuff. But I'm not sure how easy
that'll be. Should probably be in addition to the text indexing,
rather than replacing.
-- brion