There is a popularity factor at work, All CirrusSearch queries take into
account the number of incoming links as part of a rescore on a few thousand
of the top results.
There are a few ways we can tweak this. All of the examples below use
internal testing query parameters, i can't suggest using these as part of
normal production usage outside of A/B testing, but they work well for
exploring variations
query patterns used:
'opening text no boost links':
'?search=morelike:%s&cirrusBoostLinks=no&cirrusMltUseFields=yes&cirrusMltFields=opening_text',
'opening text':
'?search=morelike:%s&cirrusMltUseFields=yes&cirrusMltFields=opening_text',
'no boost links': '?search=morelike:%s&cirrusBoostLinks=no',
'basic': '?search=morelike:%s',
Test output:
A_Summer_Bird-Cage:
basic
I Know Why the Caged Bird Sings
Princess Louise, Duchess of Argyll
J. K. Rowling
opening text
I Know Why the Caged Bird Sings
Themes in Maya Angelou's autobiographies
Abnormal behaviour of birds in captivity
opening text no boost links
Themes in Maya Angelou's autobiographies
Get Sexy
I Know Why the Caged Bird Sings
no boost links
I Know Why the Caged Bird Sings
Jerusalem the Golden
Princess Louise, Duchess of Argyll
Isabel_Fonseca:
basic
Emma Goldman
Martin Amis
J. K. Rowling
opening text
I Know Why the Caged Bird Sings
Kate Millett
Hillary Clinton
opening text no boost links
I Know Why the Caged Bird Sings
Mary Beth Keane
Elizabeth F. Ellet
no boost links
Martin Amis
Margaret Fuller
Emma Goldman
Andrew_Michael_Hurley:
basic
J. K. Rowling
Enid Blyton
Ernest Shackleton
opening text
List of James Bond novels and short stories
Harry Potter
James Bond
opening text no boost links
List of James Bond novels and short stories
Childhood's End
Deborah Swift
no boost links
Pure (Miller novel)
The Other Hand
Stella Gibbons
The_Queen_of_the_Tearling:
basic
Emma Watson
J. K. Rowling
Emma Goldman
opening text
The Sun Also Rises
The Twilight Saga
The Historian
opening text no boost links
List of Buffyverse novels
Witz (novel)
It's very hard to pick and choose a few small samples of queries and say
"this is now better". I highly suggest, at a minimum, A/B testing
variations and basing results on user click through and bounce rates. Back
testing thousands of user queries and comparing them to user click through
or satisfaction (clickthrough + dwell) might be much more useful.
On Thu, Feb 18, 2016 at 4:29 PM, Jon Katz <jkatz(a)wikimedia.org> wrote:
Thanks both! This clarifies a lot. I think the
primary issue that editors
had raised and I had hoped to explore was popularity/importance v.
obscurity.
Specifically, there have been concerns that the results tilt towards more
popular articles (here
<https://www.mediawiki.org/wiki/Topic:Swjyfj59pkjfol7m> and here
<https://www.mediawiki.org/wiki/Topic:Sxy84nxinxqqld2i>), but it seems
that page traffic is not a variable. Instead, what seems to be happening
is that the raw # of similar terms is being used, rather than the % of
similar terms. This means that longer articles are favored. Is that a
fair assessment?
-J
On Thu, Feb 18, 2016 at 4:15 PM, Gergo Tisza <gtisza(a)wikimedia.org> wrote:
On Thu, Feb 18, 2016 at 4:00 PM, Jon Katz
<jkatz(a)wikimedia.org> wrote:
Can someone on this list point me to where the
more-like code sits? Or
better, yet would be someone documenting the rules that govern
prioritization of suggestions.
I would like to document the logic for our communities so that we can
have an open discussion about what variables and weighting we should use to
suggest articles.
"More like" is an Elasticsearch
<https://en.wikipedia.org/wiki/Elasticsearch> feature; the
documentation is here
<https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html>.
I'd imagine the source code is way too complicated to give much insight to
the casual reader (as Elasticsearch is a large and complex piece of
software) but I never looked into the ES codebase so that's just a guess.
The configuration we use for morelike queries is here
<https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/blob/867248ccf522541922507f23a9ddd0783bed3699/CirrusSearch.php#L450>.
The wrapper code that fires the ES query is here
<https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/blob/867248ccf522541922507f23a9ddd0783bed3699/includes/Searcher.php#L800>
(but
at a glance it doesn't do anything interesting).
_______________________________________________
Mobile-l mailing list
Mobile-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mobile-l