Re: [Wikitech-l] Indexing non-text content in LuceneSearch

9 Mar 2013

      -----Original Message-----
From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Brion Vibber
Sent: Thursday, March 7, 2013 9:59 PM
To: Wikimedia developers
Subject: Re: [Wikitech-l] Indexing non-text content in LuceneSearch
On Thu, Mar 7, 2013 at 11:45 AM, Daniel Kinzler daniel@brightbyte.de wrote:
...

create a specialized XML dump that contains the text generated by

getTextForSearchIndex() instead of actual page content.
That probably makes the most sense; alternately, make a dump that includes both "raw" data and "text for search". This also allows for indexing extra stuff for files -- such as extracted text from a PDF of DjVu or metadata from a JPEG -- if the dump process etc can produce appropriate indexable data.
...
However, that only works
if the dump is created using the PHP dumper. How are the regular dumps 
currently generated on WMF infrastructure? Also, would be be feasible 
to make an extra dump just for LuceneSearch (at least for wikidata.org)?
The dumps are indeed created via MediaWiki. I think Ariel or someone can comment with more detail on how it currently runs, it's been a while since I was in the thick of it.
...

We could re-implement the ContentHandler facility in Java, and

require extensions that define their own content types to provide a 
Java based handler in addition to the PHP one. That seems like a 
pretty massive undertaking of dubious value. But it would allow maximum control over what is indexed how.
Nooooo don't do it :)
...

The indexer code (without plugins) should not know about Wikibase,

but it may have hard coded knowledge about JSON. It could have a 
special indexing mode for JSON, in which the structure is deserialized 
and traversed, and any values are added to the index (while the keys 
used in the structure would be ignored). We may still be indexing 
useless interna from the JSON, but at least there would be a lot fewer false negatives.
Indexing structured data could be awesome -- again I think of file metadata as well as wikidata-style stuff. But I'm not sure how easy that'll be. Should probably be in addition to the text indexing, rather than replacing.
-- brion
I agree with Brion.
Here are my 5 shenekel's worth.
To indexing non-mwdumps with LuceneSearch I would:
1. modify the demon to read the custom/dump format or update the xml dump to support json dump. 
2. it uses the MWdumper codebase to do this now.
3. add a lucene analyzer to handle the new data type, say a json analyzer.
4. add a Lucenedoc per Json based Wikidata schema
5. update the queries parser to handle the new queries and the modified Lucene documents.
6. for bonus points modify spelling correction and write a wiki data ranking algoritm
But this would only solve reading static dumps used to bootstrap the index, I would then have to 
Change how MWSearch periodically polls Brion's OAIRepository to pull in updated pages.
I've been coding some analytics from MWDumps from WMF/Wikia Wikis for research project I can say this:
1. Most big dumps (e.g. historic) inherit the isses of wikitext namely unescaped tags and entities which crash modern XML java libraries - so escape your data and validate the xml!
2. The god old SAX code in the MWDumper still works fine - so use it.
3. Use lucene 2.4 with the deprecated old APIs
4. Ariel is doing a great job (e.g. the 7Z compression and the splitting of the dumps) but these are things MWdumper does not handle yet.
Finally based on my work with i18n team, TranslateWiki search that indexing JSON data with Solar + Solarium requires no Search Engine coding at all.
You define the document schema, and use solarium to push JSON and get results too. I could do a demo of how to do this at a coming Hakathon if there
is any interest, however when I offered to replace LuceneSearch like this last October the idea was rejected out of hand.
-- oren
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Indexing non-text content in LuceneSearch