On Thu, Oct 29, 2015 at 8:47 AM, Strainu strainu10@gmail.com wrote:
Hi,
I've been reading the mw.org and wikitech pages on Cirrussearch (and the code) in the hope that I will be able to understand how is the page content transformed before being sent to ES and how is it kept in ES and I have a few questions:
- Is the documentation available anywhere? I don't see it on
Feature documentation is at https://www.mediawiki.org/wiki/Help:CirrusSearch, operational documentation is at https://wikitech.wikimedia.org/wiki/Search
- What part of the whole ecosystem transforms the wikitext into
indexable text? Where can I find it? It should be somewhere downstream fromCirrusSearch\Updater::updateFromTitle(), but I can't figure uout where exactly.
The documents are built using the classes in https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/tree/master/i...
If this transformation doesn't happen, from where is the searchable text obtained?
- Where can I find the ES schema used for wikipages? Is it different
for images/categories?
ES schema is the same everywhere, the easiest way to see what the data looks like is just request a dump for a particular page. This will output json, i use a chrome extension called JsonView to make this look nice: https://wikitech.wikimedia.org/wiki/Search?action=cirrusdump
Thanks, Strainu
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l