Re: [Wikitech-l] Indexing non-text content in LuceneSearch

7 Mar 2013

On Thu, Mar 7, 2013 at 11:45 AM, Daniel Kinzler &lt;daniel(a)brightbyte.de&gt; wrote:
...
  1) create a specialized XML dump that contains the
text generated by
 getTextForSearchIndex() instead of actual page content. 
That probably makes the most sense; alternately, make a dump that
includes both "raw" data and "text for search". This also allows for
indexing extra stuff for files -- such as extracted text from a PDF of
DjVu or metadata from a JPEG -- if the dump process etc can produce
appropriate indexable data.

...
  However, that only works
 if the dump is created using the PHP dumper. How are the regular dumps currently
 generated on WMF infrastructure? Also, would be be feasible to make an extra
 dump just for LuceneSearch (at least for wikidata.org)? 
The dumps are indeed created via MediaWiki. I think Ariel or someone
can comment with more detail on how it currently runs, it's been a
while since I was in the thick of it.

...
  2) We could re-implement the ContentHandler facility
in Java, and require
 extensions that define their own content types to provide a Java based handler
 in addition to the PHP one. That seems like a pretty massive undertaking of
 dubious value. But it would allow maximum control over what is indexed how. 
Nooooo don't do it :)

...
  3) The indexer code (without plugins) should not know
about Wikibase, but it may
 have hard coded knowledge about JSON. It could have a special indexing mode for
 JSON, in which the structure is deserialized and traversed, and any values are
 added to the index (while the keys used in the structure would be ignored). We
 may still be indexing useless interna from the JSON, but at least there would be
 a lot fewer false negatives. 
Indexing structured data could be awesome -- again I think of file
metadata as well as wikidata-style stuff. But I'm not sure how easy
that'll be. Should probably be in addition to the text indexing,
rather than replacing.

-- brion

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Indexing non-text content in LuceneSearch