finding related pages [was Re: Reading tech sessions at hackathon]

List overview All Threads
Download

newer

older

wdqs beta updater performance data

Can we drop support for running...

S Page

1 Jun 2015 1 Jun '15

8:57 p.m.

Summary: * CirrusSearch has "morelike:*PageName*", who knew? * I sense a developer article brewing, "Finding related content"

AIUI, the Wikipedia mobile apps' "Read more" section just performs a full-text search (API [1] ) for the current page title (Android source [2]).

Joaquin's nfity demo http://chimeces.com/webkipedia/ 's "Related pages" section calls the GettingStarted extension's gettingstartedgetpages API module [3] with gsgptaskname=morelike . This is implemented by GettingStarted/MoreLikePageSuggester.php... and it seems this just makes a search query for srsearch=morelike:Australia . Who knew Cirrus search had a "morelike:" keyword? It's not in the enwiki search help, but it is in the Cirrus search help [4].

I'm not sure if there's any reason to interpose gettingstartedgetpages instead of querying search directly for morelike:*pagetitle*, it might cache stuff in Redis. The mobile apps might get better "Read more" suggestions using one of these.

There's also a srwhat=suggestion, I don't know if that helps getting related pages.

I'll be updating https://www.mediawiki.org/wiki/API:Search_and_discovery with this, and it seems article-worthy.

Cheers, hope this helps someone.

[1] https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bsearch

[2] https://github.com/wikimedia/apps-android-wikipedia/blob/de0b8b579f5030f6843...

[3] https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bgettingst...

[4] https://www.mediawiki.org/wiki/Help:CirrusSearch#Special_prefixes

On Fri, May 29, 2015 at 2:18 AM, Joaquin Oltra Hernandez < jhernandez@wikimedia.org> wrote:

...

Sorry forgot to link it: https://github.com/joakin/webkipedia

Matt Flaschen told me about the gettingstarted 'morelike' mode for other purposes, but it fitted perfectly my purposes for this reading app. They developed it on the Growth team about a year ago, but the experiment wasn't successful so the API has been there dormant and unused for a lot of time (works pretty well!).

About the content I'm fetching for the articles, I'm using the extracts with the exintro option, and embedding the html ( https://github.com/joakin/webkipedia/blob/master/lib/api/article.js). The idea would be to have a 'Read more' that would show the full article I guess.

The president's chest is pretty good too :D http://chimeces.com/webkipedia/#/wiki/Barack_Obama

On Fri, May 29, 2015 at 4:35 AM, S Page spage@wikimedia.org wrote:

...
(Cc'ing James Douglas, who's also developing API playground code.)

On Thu, May 28, 2015 at 2:52 AM, Joaquin Oltra Hernandez < jhernandez@wikimedia.org> wrote:

...
S, for getting started quickly, I set up a JS web app completely standalone with some basic infrastructure (libraries for calling the api, rendering pipeline of JS views, url routing) so that interested people could just get quickly to render a view within the app and do interesting stuff querying the API. We were also open to just doing a plain html file with some JS and CSS, or a codepen/jsbin style would have worked too.

Here's the demo of the lite wikipedia webapp I worked on: http://chimeces.com/webkipedia/

That's lovely! It's what API developers develop when they develop.

Where's the source?

I had no idea the gettingstartedgetpages would give you related pages,

so obscure!

I guess RESTBase has no mode that strips the citations and such, or

gives you just the opening section (prop=extracts & exintro=)

Nice closeup of the great man's chest :-),

http://chimeces.com/webkipedia/#/wiki/Albert_Einstein

-- =S Page WMF Tech writer

-- =S Page WMF Tech writer

Attachments:

attachment.htm (text/html — 6.6 KB)

Show replies by date

Nikolas Everett

1 Jun 1 Jun

9:24 p.m.

New subject: finding related pages [was Re: Reading tech sessions at hackathon]

On Mon, Jun 1, 2015 at 4:57 PM, S Page spage@wikimedia.org wrote:

...

Summary:

CirrusSearch has "morelike:*PageName*", who knew?

People that read https://www.mediawiki.org/wiki/Help:CirrusSearch#Special_prefixes I guess.

* I sense a developer article brewing, "Finding related content"

...

AIUI, the Wikipedia mobile apps' "Read more" section just performs a full-text search (API [1] ) for the current page title (Android source [2]).

Joaquin's nfity demo http://chimeces.com/webkipedia/ 's "Related pages" section calls the GettingStarted extension's gettingstartedgetpages API module [3] with gsgptaskname=morelike . This is implemented by GettingStarted/MoreLikePageSuggester.php... and it seems this just makes a search query for srsearch=morelike:Australia . Who knew Cirrus search had a "morelike:" keyword? It's not in the enwiki search help, but it is in the Cirrus search help [4].

Ah. Now you mentioned it :)

...

I'm not sure if there's any reason to interpose gettingstartedgetpages instead of querying search directly for morelike:*pagetitle*, it might cache stuff in Redis. The mobile apps might get better "Read more" suggestions using one of these.

That query is slightly heavier than your average search query but its not a devestatingly expensive query. So its probably not worth caching anything. Maybe if we really publicize it we should be more careful with the poolcounter, but its probably ok.

...

There's also a srwhat=suggestion, I don't know if that helps getting related pages.

Its "did you mean:" I believe.

...

I'll be updating https://www.mediawiki.org/wiki/API:Search_and_discovery with this, and it seems article-worthy.

Cheers, hope this helps someone.

You can do it for multiple pages if you like by putting a pipe between the names. like morelike:Time Warner Cable|CNN https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=default&search=morelike%3ATime+Warner+Cable|CNN&fulltext=Search. We added that after talking with the gettingstarted folks.

We've never tuned this feature. We could certainly do more with it if people were excited by it.

Nik

Bernd Sitzmann

10:07 p.m.

New subject: [reading-wmf] finding related pages [was Re: Reading tech sessions at hackathon]

The few terms I've tried it on morelike: search prefix produced better Read more articles than our old way (search for page title with full text search, and remove the article with the same title). So, add me to the excited column. :)

On Mon, Jun 1, 2015 at 11:24 PM, Nikolas Everett neverett@wikimedia.org wrote:

...

That query is slightly heavier than your average search query but its not a devestatingly expensive query. So its probably not worth caching anything. Maybe if we really publicize it we should be more careful with the poolcounter, but its probably ok.

...
There's also a srwhat=suggestion, I don't know if that helps getting related pages.

Its "did you mean:" I believe.

The Android app uses srinfo=suggestion for "Did you mean?". AFAICS from the API sandbox there is no srwhat=suggestion. That was probably a typo.

Bernd

Dan Garry

2 Jun 2 Jun

1:54 p.m.

New subject: [reading-wmf] finding related pages [was Re: Reading tech sessions at hackathon]

On 1 June 2015 at 23:07, Bernd Sitzmann bernd@wikimedia.org wrote:

...

The few terms I've tried it on morelike: search prefix produced better Read more articles than our old way

Funny, I kind of found the opposite! So, I suggest running a test.

You could increment the MobileWikiAppArticleSuggestions https://meta.wikimedia.org/wiki/Schema:MobileWikiAppArticleSuggestions schema, removing the "version" field (since it's redundant now anyway) and adding a "suggestionsSource" field. Make a copy of SuggestionsTask https://github.com/wikimedia/apps-android-wikipedia/blob/master/wikipedia/src/main/java/org/wikipedia/page/SuggestionsTask.java which uses the new method to generate results. Bucket users 50/50, half of them getting the old method for suggestions and half of them getting the new method. Transmit which version they got in the "suggestionsSource" field. Run analysis to determine which gets users to engage more, then go with that way! This would make a nice quarterly goal for next quarter, I think. :-)

Thanks, Dan

-- Dan Garry Product Manager, Search and Discovery Wikimedia Foundation

Nikolas Everett

2:06 p.m.

New subject: [reading-wmf] finding related pages [was Re: Reading tech sessions at hackathon]

These are the options we use for the more_like_this query: $wgCirrusSearchMoreLikeThisConfig = array( 'min_doc_freq' => 2, // Minimum number of documents (per shard) that need a term for it to be considered 'max_query_terms' => 25, 'min_term_freq' => 2, 'percent_terms_to_match' => 0.3, 'min_word_len' => 0, 'max_word_len' => 0, );

Here is the reference for what they mean and any more we might be able to set: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-ml...

We only use the "text" field of the articles - no weighting based on, well, anything. See the text field in https://en.wikipedia.org/wiki/Barack_Obama?action=cirrusdump for example.

Stuff we could do really, really easily: 1. Add url parameters that override each of those options for easy experimenting. 2. Add url parameters to use different fields like our weighted all field, the wikitext, or intro paragraphs (don't ask how we extract into paragraphs - its a horrible hack), or the section headers, or the "secondary" text like the inforboxes and image subtitles.

These are seriously very little work. A couple of hours. A day if we're being really good about testing _and_ someone merges something to core that screws up the tests. If it enables lots of cool experimenting I'm all for doing it.

Nik

On Tue, Jun 2, 2015 at 9:54 AM, Dan Garry dgarry@wikimedia.org wrote:

...

On 1 June 2015 at 23:07, Bernd Sitzmann bernd@wikimedia.org wrote:

...
The few terms I've tried it on morelike: search prefix produced better Read more articles than our old way

Funny, I kind of found the opposite! So, I suggest running a test.

You could increment the MobileWikiAppArticleSuggestions https://meta.wikimedia.org/wiki/Schema:MobileWikiAppArticleSuggestions schema, removing the "version" field (since it's redundant now anyway) and adding a "suggestionsSource" field. Make a copy of SuggestionsTask https://github.com/wikimedia/apps-android-wikipedia/blob/master/wikipedia/src/main/java/org/wikipedia/page/SuggestionsTask.java which uses the new method to generate results. Bucket users 50/50, half of them getting the old method for suggestions and half of them getting the new method. Transmit which version they got in the "suggestionsSource" field. Run analysis to determine which gets users to engage more, then go with that way! This would make a nice quarterly goal for next quarter, I think. :-)

Thanks, Dan

-- Dan Garry Product Manager, Search and Discovery Wikimedia Foundation

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

Dan Garry

3:03 p.m.

New subject: [reading-wmf] finding related pages [was Re: Reading tech sessions at hackathon]

Nik, can you reflect this work in a task?

If it's something we can knock off quickly which will enable Readership to experiment on a self-serve basis, we should do it.

Thanks, Dan

On 2 June 2015 at 15:06, Nikolas Everett neverett@wikimedia.org wrote:

...

These are the options we use for the more_like_this query: $wgCirrusSearchMoreLikeThisConfig = array( 'min_doc_freq' => 2, // Minimum number of documents (per shard) that need a term for it to be considered 'max_query_terms' => 25, 'min_term_freq' => 2, 'percent_terms_to_match' => 0.3, 'min_word_len' => 0, 'max_word_len' => 0, );

Here is the reference for what they mean and any more we might be able to set: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-ml...

We only use the "text" field of the articles - no weighting based on, well, anything. See the text field in https://en.wikipedia.org/wiki/Barack_Obama?action=cirrusdump for example.

Stuff we could do really, really easily:

Add url parameters that override each of those options for easy

experimenting. 2. Add url parameters to use different fields like our weighted all field, the wikitext, or intro paragraphs (don't ask how we extract into paragraphs

its a horrible hack), or the section headers, or the "secondary" text

like the inforboxes and image subtitles.

These are seriously very little work. A couple of hours. A day if we're being really good about testing _and_ someone merges something to core that screws up the tests. If it enables lots of cool experimenting I'm all for doing it.

Nik

On Tue, Jun 2, 2015 at 9:54 AM, Dan Garry dgarry@wikimedia.org wrote:

...
On 1 June 2015 at 23:07, Bernd Sitzmann bernd@wikimedia.org wrote:

...
The few terms I've tried it on morelike: search prefix produced better Read more articles than our old way

Funny, I kind of found the opposite! So, I suggest running a test.

You could increment the MobileWikiAppArticleSuggestions https://meta.wikimedia.org/wiki/Schema:MobileWikiAppArticleSuggestions schema, removing the "version" field (since it's redundant now anyway) and adding a "suggestionsSource" field. Make a copy of SuggestionsTask https://github.com/wikimedia/apps-android-wikipedia/blob/master/wikipedia/src/main/java/org/wikipedia/page/SuggestionsTask.java which uses the new method to generate results. Bucket users 50/50, half of them getting the old method for suggestions and half of them getting the new method. Transmit which version they got in the "suggestionsSource" field. Run analysis to determine which gets users to engage more, then go with that way! This would make a nice quarterly goal for next quarter, I think. :-)

Thanks, Dan

-- Dan Garry Product Manager, Search and Discovery Wikimedia Foundation

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

-- Dan Garry Product Manager, Search and Discovery Wikimedia Foundation

Corey Floyd

3:29 p.m.

New subject: [reading-wmf] finding related pages [was Re: Reading tech sessions at hackathon]

We created a ticket for this on iOS as soon as Joaquin gave his presentation at the hackathon.

Since we have the search schema setup, we can swap out the service and check the stats for the new version. (It would be nice if we had an additional field to specify the “search service” in the schema, but we could just query the specific version for which the search service was changed to get the data we need)

While anecdotal comparisons of which service produces the "better” results are useful for evaluating the algorithm - the real test is in the analytics. If read more click through goes up, its better, if it goes down, it’s worse.

As far as tweaking the algorithm, I’d like to keep that all server side and let the clients be dumb. Maybe the API can return a value that would represent the algorithm so we could save that to the analytics for comparison?

On Tue, Jun 2, 2015 at 11:03 AM, Dan Garry dgarry@wikimedia.org wrote:

...

Nik, can you reflect this work in a task?

If it's something we can knock off quickly which will enable Readership to experiment on a self-serve basis, we should do it.

Thanks, Dan

On 2 June 2015 at 15:06, Nikolas Everett neverett@wikimedia.org wrote:

...
These are the options we use for the more_like_this query: $wgCirrusSearchMoreLikeThisConfig = array( 'min_doc_freq' => 2, // Minimum number of documents (per shard) that need a term for it to be considered 'max_query_terms' => 25, 'min_term_freq' => 2, 'percent_terms_to_match' => 0.3, 'min_word_len' => 0, 'max_word_len' => 0, );

Here is the reference for what they mean and any more we might be able to set: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-ml...

We only use the "text" field of the articles - no weighting based on, well, anything. See the text field in https://en.wikipedia.org/wiki/Barack_Obama?action=cirrusdump for example.

Stuff we could do really, really easily:

Add url parameters that override each of those options for easy

experimenting. 2. Add url parameters to use different fields like our weighted all field, the wikitext, or intro paragraphs (don't ask how we extract into paragraphs - its a horrible hack), or the section headers, or the "secondary" text like the inforboxes and image subtitles.

These are seriously very little work. A couple of hours. A day if we're being really good about testing _and_ someone merges something to core that screws up the tests. If it enables lots of cool experimenting I'm all for doing it.

Nik

On Tue, Jun 2, 2015 at 9:54 AM, Dan Garry dgarry@wikimedia.org wrote:

...
On 1 June 2015 at 23:07, Bernd Sitzmann bernd@wikimedia.org wrote:

...
The few terms I've tried it on morelike: search prefix produced better Read more articles than our old way

Funny, I kind of found the opposite! So, I suggest running a test.

You could increment the MobileWikiAppArticleSuggestions https://meta.wikimedia.org/wiki/Schema:MobileWikiAppArticleSuggestions schema, removing the "version" field (since it's redundant now anyway) and adding a "suggestionsSource" field. Make a copy of SuggestionsTask https://github.com/wikimedia/apps-android-wikipedia/blob/master/wikipedia/src/main/java/org/wikipedia/page/SuggestionsTask.java which uses the new method to generate results. Bucket users 50/50, half of them getting the old method for suggestions and half of them getting the new method. Transmit which version they got in the "suggestionsSource" field. Run analysis to determine which gets users to engage more, then go with that way! This would make a nice quarterly goal for next quarter, I think. :-)

Thanks, Dan

-- Dan Garry Product Manager, Search and Discovery Wikimedia Foundation

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

-- Dan Garry Product Manager, Search and Discovery Wikimedia Foundation

reading-wmf mailing list reading-wmf@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/reading-wmf

-- Corey Floyd Software Engineer Mobile Apps / iOS Wikimedia Foundation

Nikolas Everett

3:36 p.m.

New subject: [reading-wmf] finding related pages [was Re: Reading tech sessions at hackathon]

On Tue, Jun 2, 2015 at 11:03 AM, Dan Garry dgarry@wikimedia.org wrote:

...

Nik, can you reflect this work in a task?

https://phabricator.wikimedia.org/T101111

Dmitry Brant

2:45 p.m.

New subject: [reading-wmf] finding related pages [was Re: Reading tech sessions at hackathon]

Dan, I'm curious what term(s) produced better results using full-text search rather than morelike? Maybe we can use those terms when tuning the parameters that Nik mentioned. Here's an etherpad for taking note of which method is better for which terms: https://etherpad.wikimedia.org/p/morelike_vs_fulltext

On Tue, Jun 2, 2015 at 9:54 AM, Dan Garry dgarry@wikimedia.org wrote:

...

On 1 June 2015 at 23:07, Bernd Sitzmann bernd@wikimedia.org wrote:

...
The few terms I've tried it on morelike: search prefix produced better Read more articles than our old way

Funny, I kind of found the opposite! So, I suggest running a test.

You could increment the MobileWikiAppArticleSuggestions https://meta.wikimedia.org/wiki/Schema:MobileWikiAppArticleSuggestions schema, removing the "version" field (since it's redundant now anyway) and adding a "suggestionsSource" field. Make a copy of SuggestionsTask https://github.com/wikimedia/apps-android-wikipedia/blob/master/wikipedia/src/main/java/org/wikipedia/page/SuggestionsTask.java which uses the new method to generate results. Bucket users 50/50, half of them getting the old method for suggestions and half of them getting the new method. Transmit which version they got in the "suggestionsSource" field. Run analysis to determine which gets users to engage more, then go with that way! This would make a nice quarterly goal for next quarter, I think. :-)

Thanks, Dan

-- Dan Garry Product Manager, Search and Discovery Wikimedia Foundation

reading-wmf mailing list reading-wmf@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/reading-wmf

Gergo Tisza

1 Jun 1 Jun

10:16 p.m.

New subject: [reading-wmf] finding related pages [was Re: Reading tech sessions at hackathon]

On Mon, Jun 1, 2015 at 1:57 PM, S Page spage@wikimedia.org wrote:

...

I'm not sure if there's any reason to interpose gettingstartedgetpages instead of querying search directly for morelike:*pagetitle*, it might cache stuff in Redis. The mobile apps might get better "Read more" suggestions using one of these.

At a glance, Redis is only used for caching category contents (ie. not at all when the morelike option is used). Anyway, if the concern is frontend performance, then requests should be cached in Varnish (otherwise the request has to travel to the data center, spin up a PHP process, load MediaWiki etc. so it's going to be slow even if all the API module does is fetch a value from cache), and that can be done for any API module via (s)maxage [1].

OTOH there is a good reason not to use gettingstartedgetpages, namely that GettingStarted is not installed on most WMF wikis.

[1] http://www.mediawiki.org/wiki/API:Mainmodule

Joaquin Oltra Hernandez

2 Jun 2 Jun

9:29 a.m.

New subject: [reading-wmf] finding related pages [was Re: Reading tech sessions at hackathon]

Thanks S for the info.

Nick I think the API is going to be very useful and hopefully we can integrate it to improve the engagement and reading experience. Hopefully there this will prove more useful than suggesting articles to edit. It is already pretty good, If we can tune it even further to get better "reading" suggestions, it's going to be even more awesome.

I didn't know about the multiple titles, that's going to be useful for the Gather experiment we'll be doing for suggesting similar articles to add to a collection. Thanks.

On Tue, Jun 2, 2015 at 12:16 AM, Gergo Tisza gtisza@wikimedia.org wrote:

...

On Mon, Jun 1, 2015 at 1:57 PM, S Page spage@wikimedia.org wrote:

...
I'm not sure if there's any reason to interpose gettingstartedgetpages instead of querying search directly for morelike:*pagetitle*, it might cache stuff in Redis. The mobile apps might get better "Read more" suggestions using one of these.

At a glance, Redis is only used for caching category contents (ie. not at all when the morelike option is used). Anyway, if the concern is frontend performance, then requests should be cached in Varnish (otherwise the request has to travel to the data center, spin up a PHP process, load MediaWiki etc. so it's going to be slow even if all the API module does is fetch a value from cache), and that can be done for any API module via (s)maxage [1].

OTOH there is a good reason not to use gettingstartedgetpages, namely that GettingStarted is not installed on most WMF wikis.

[1] http://www.mediawiki.org/wiki/API:Mainmodule

3341

Age (days ago)

3342

Last active (days ago)

wikimedia-search@lists.wikimedia.org

10 comments

8 participants

tags (0)

participants (8)

Bernd Sitzmann
Corey Floyd
Dan Garry
Dmitry Brant
Gergo Tisza
Joaquin Oltra Hernandez
Nikolas Everett
S Page