Some questions about Cirrussearch/Elasticsearch

List overview All Threads
Download

newer

older

Please help finish "Report a...

Wikidata descriptions to show on...

Strainu

29 Oct 2015 29 Oct '15

3:47 p.m.

Hi,

I've been reading the mw.org and wikitech pages on Cirrussearch (and the code) in the hope that I will be able to understand how is the page content transformed before being sent to ES and how is it kept in ES and I have a few questions:

1. Is the documentation available anywhere? I don't see it on https://doc.wikimedia.org/

2. What part of the whole ecosystem transforms the wikitext into indexable text? Where can I find it? It should be somewhere downstream fromCirrusSearch\Updater::updateFromTitle(), but I can't figure uout where exactly.

If this transformation doesn't happen, from where is the searchable text obtained?

3. Where can I find the ES schema used for wikipages? Is it different for images/categories?

Thanks, Strainu

Show replies by date

Erik Bernhardson

29 Oct 29 Oct

3:56 p.m.

On Thu, Oct 29, 2015 at 8:47 AM, Strainu strainu10@gmail.com wrote:

...

Hi,

I've been reading the mw.org and wikitech pages on Cirrussearch (and the code) in the hope that I will be able to understand how is the page content transformed before being sent to ES and how is it kept in ES and I have a few questions:

Is the documentation available anywhere? I don't see it on

https://doc.wikimedia.org/

Feature documentation is at https://www.mediawiki.org/wiki/Help:CirrusSearch, operational documentation is at https://wikitech.wikimedia.org/wiki/Search

...

What part of the whole ecosystem transforms the wikitext into

indexable text? Where can I find it? It should be somewhere downstream fromCirrusSearch\Updater::updateFromTitle(), but I can't figure uout where exactly.

The documents are built using the classes in https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/tree/master/i...

...

If this transformation doesn't happen, from where is the searchable text obtained?

Where can I find the ES schema used for wikipages? Is it different

for images/categories?

ES schema is the same everywhere, the easiest way to see what the data looks like is just request a dump for a particular page. This will output json, i use a chrome extension called JsonView to make this look nice: https://wikitech.wikimedia.org/wiki/Search?action=cirrusdump

...

Thanks, Strainu

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Strainu

9:22 p.m.

Thanks for the response Erik, it's been very informative. I have a few follow up questions (inline)

On 29 octombrie 2015 17:56:25 EET, Erik Bernhardson ebernhardson@wikimedia.org wrote:

...

On Thu, Oct 29, 2015 at 8:47 AM, Strainu strainu10@gmail.com wrote:

...
Hi,

I've been reading the mw.org and wikitech pages on Cirrussearch (and the code) in the hope that I will be able to understand how is the page content transformed before being sent to ES and how is it kept

in

...
ES and I have a few questions:

Is the documentation available anywhere? I don't see it on

https://doc.wikimedia.org/

Feature documentation is at https://www.mediawiki.org/wiki/Help:CirrusSearch, operational documentation is at https://wikitech.wikimedia.org/wiki/Search

I was referring to the code docs, they make it easier to follow the class hierarchy.

...

...

What part of the whole ecosystem transforms the wikitext into

indexable text? Where can I find it? It should be somewhere

downstream

...
fromCirrusSearch\Updater::updateFromTitle(), but I can't figure uout where exactly.

The documents are built using the classes in https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/tree/master/i...

I see you use already parsed text. I'm wondering if using the output of mwparserfromhell would work - I have some wikitext that is not in a mw database that I would like to index. I'm guessing I'll have to write some code, but the idea would be the same.

...

...
If this transformation doesn't happen, from where is the searchable text obtained?

Where can I find the ES schema used for wikipages? Is it different

for images/categories?

ES schema is the same everywhere, the easiest way to see what the data looks like is just request a dump for a particular page. This will output json, i use a chrome extension called JsonView to make this look nice: https://wikitech.wikimedia.org/wiki/Search?action=cirrusdump

That is very cool indeed.

Thanks again, Strainu

...

...
Thanks, Strainu

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

Erik Bernhardson

9:55 p.m.

On Thu, Oct 29, 2015 at 2:22 PM, Strainu strainu10@gmail.com wrote:

...

Thanks for the response Erik, it's been very informative. I have a few follow up questions (inline)

On 29 octombrie 2015 17:56:25 EET, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:

...
On Thu, Oct 29, 2015 at 8:47 AM, Strainu strainu10@gmail.com wrote:

...
Hi,

I've been reading the mw.org and wikitech pages on Cirrussearch (and the code) in the hope that I will be able to understand how is the page content transformed before being sent to ES and how is it kept

in

...
ES and I have a few questions:

Is the documentation available anywhere? I don't see it on

https://doc.wikimedia.org/

Feature documentation is at https://www.mediawiki.org/wiki/Help:CirrusSearch, operational documentation is at https://wikitech.wikimedia.org/wiki/Search

I was referring to the code docs, they make it easier to follow the class hierarchy.

There is very minimal documentation of the code, outside of the code itself. The best you will find, which only cover a small portion of the code, are the parts i wrote up in the Indexing https://wikitech.wikimedia.org/wiki/Search#Indexing and Job queue https://wikitech.wikimedia.org/wiki/Search#Job_queue portions of the operational documentation. Feel free to stop by the #wikimedia-discovery channel on freenode and ask questions, some of the developers on the team might be able to point you in the right direction.

...

...
...

What part of the whole ecosystem transforms the wikitext into

indexable text? Where can I find it? It should be somewhere

downstream

...
fromCirrusSearch\Updater::updateFromTitle(), but I can't figure uout where exactly.

The documents are built using the classes in

https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/tree/master/i...

I see you use already parsed text. I'm wondering if using the output of mwparserfromhell would work - I have some wikitext that is not in a mw database that I would like to index. I'm guessing I'll have to write some code, but the idea would be the same.

Correct, we use the output of the php wikitext parser for the initial portion of the transformation. The easiest way to integrate with cirrussearch will be to reuse the mediawiki parser. I've never played with mwparserfromhell but as with most software, with some effort you can tie almost anything together :)

...

...
...
If this transformation doesn't happen, from where is the searchable text obtained?

Where can I find the ES schema used for wikipages? Is it different

for images/categories?

ES schema is the same everywhere, the easiest way to see what the data looks like is just request a dump for a particular page. This will output json, i use a chrome extension called JsonView to make this look nice: https://wikitech.wikimedia.org/wiki/Search?action=cirrusdump

That is very cool indeed.

Thanks again, Strainu

...
...
Thanks, Strainu

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

3184

Age (days ago)

3184

Last active (days ago)

wikitech-l@lists.wikimedia.org

3 comments

2 participants

tags (0)

participants (2)

Erik Bernhardson
Strainu