On Thu, Mar 7, 2013 at 11:45 AM, Daniel Kinzler daniel@brightbyte.de wrote:
- create a specialized XML dump that contains the text generated by
getTextForSearchIndex() instead of actual page content.
That probably makes the most sense; alternately, make a dump that includes both "raw" data and "text for search". This also allows for indexing extra stuff for files -- such as extracted text from a PDF of DjVu or metadata from a JPEG -- if the dump process etc can produce appropriate indexable data.
However, that only works if the dump is created using the PHP dumper. How are the regular dumps currently generated on WMF infrastructure? Also, would be be feasible to make an extra dump just for LuceneSearch (at least for wikidata.org)?
The dumps are indeed created via MediaWiki. I think Ariel or someone can comment with more detail on how it currently runs, it's been a while since I was in the thick of it.
- We could re-implement the ContentHandler facility in Java, and require
extensions that define their own content types to provide a Java based handler in addition to the PHP one. That seems like a pretty massive undertaking of dubious value. But it would allow maximum control over what is indexed how.
Nooooo don't do it :)
- The indexer code (without plugins) should not know about Wikibase, but it may
have hard coded knowledge about JSON. It could have a special indexing mode for JSON, in which the structure is deserialized and traversed, and any values are added to the index (while the keys used in the structure would be ignored). We may still be indexing useless interna from the JSON, but at least there would be a lot fewer false negatives.
Indexing structured data could be awesome -- again I think of file metadata as well as wikidata-style stuff. But I'm not sure how easy that'll be. Should probably be in addition to the text indexing, rather than replacing.
-- brion