Re: [Wikitech-l] Indexing non-text content in LuceneSearch

7 Mar 2013

      On Thu, Mar 7, 2013 at 11:45 AM, Daniel Kinzler daniel@brightbyte.de wrote:
...

create a specialized XML dump that contains the text generated by

getTextForSearchIndex() instead of actual page content.
That probably makes the most sense; alternately, make a dump that
includes both "raw" data and "text for search". This also allows for
indexing extra stuff for files -- such as extracted text from a PDF of
DjVu or metadata from a JPEG -- if the dump process etc can produce
appropriate indexable data.
...
However, that only works
if the dump is created using the PHP dumper. How are the regular dumps currently
generated on WMF infrastructure? Also, would be be feasible to make an extra
dump just for LuceneSearch (at least for wikidata.org)?
The dumps are indeed created via MediaWiki. I think Ariel or someone
can comment with more detail on how it currently runs, it's been a
while since I was in the thick of it.
...

We could re-implement the ContentHandler facility in Java, and require

extensions that define their own content types to provide a Java based handler
in addition to the PHP one. That seems like a pretty massive undertaking of
dubious value. But it would allow maximum control over what is indexed how.
Nooooo don't do it :)
...

The indexer code (without plugins) should not know about Wikibase, but it may

have hard coded knowledge about JSON. It could have a special indexing mode for
JSON, in which the structure is deserialized and traversed, and any values are
added to the index (while the keys used in the structure would be ignored). We
may still be indexing useless interna from the JSON, but at least there would be
a lot fewer false negatives.
Indexing structured data could be awesome -- again I think of file
metadata as well as wikidata-style stuff. But I'm not sure how easy
that'll be. Should probably be in addition to the text indexing,
rather than replacing.
-- brion

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Indexing non-text content in LuceneSearch