There is a popularity factor at work, All CirrusSearch queries take into account the number of incoming links as part of a rescore on a few thousand of the top results.
There are a few ways we can tweak this. All of the examples below use internal testing query parameters, i can't suggest using these as part of normal production usage outside of A/B testing, but they work well for exploring variations
query patterns used: 'opening text no boost links': '?search=morelike:%s&cirrusBoostLinks=no&cirrusMltUseFields=yes&cirrusMltFields=opening_text', 'opening text': '?search=morelike:%s&cirrusMltUseFields=yes&cirrusMltFields=opening_text', 'no boost links': '?search=morelike:%s&cirrusBoostLinks=no', 'basic': '?search=morelike:%s',
Test output: A_Summer_Bird-Cage: basic I Know Why the Caged Bird Sings Princess Louise, Duchess of Argyll J. K. Rowling
opening text I Know Why the Caged Bird Sings Themes in Maya Angelou's autobiographies Abnormal behaviour of birds in captivity
opening text no boost links Themes in Maya Angelou's autobiographies Get Sexy I Know Why the Caged Bird Sings
no boost links I Know Why the Caged Bird Sings Jerusalem the Golden Princess Louise, Duchess of Argyll
Isabel_Fonseca: basic Emma Goldman Martin Amis J. K. Rowling
opening text I Know Why the Caged Bird Sings Kate Millett Hillary Clinton
opening text no boost links I Know Why the Caged Bird Sings Mary Beth Keane Elizabeth F. Ellet
no boost links Martin Amis Margaret Fuller Emma Goldman
Andrew_Michael_Hurley: basic J. K. Rowling Enid Blyton Ernest Shackleton
opening text List of James Bond novels and short stories Harry Potter James Bond
opening text no boost links List of James Bond novels and short stories Childhood's End Deborah Swift
no boost links Pure (Miller novel) The Other Hand Stella Gibbons
The_Queen_of_the_Tearling: basic Emma Watson J. K. Rowling Emma Goldman
opening text The Sun Also Rises The Twilight Saga The Historian
opening text no boost links List of Buffyverse novels Witz (novel)
It's very hard to pick and choose a few small samples of queries and say "this is now better". I highly suggest, at a minimum, A/B testing variations and basing results on user click through and bounce rates. Back testing thousands of user queries and comparing them to user click through or satisfaction (clickthrough + dwell) might be much more useful.
On Thu, Feb 18, 2016 at 4:29 PM, Jon Katz jkatz@wikimedia.org wrote:
Thanks both! This clarifies a lot. I think the primary issue that editors had raised and I had hoped to explore was popularity/importance v. obscurity.
Specifically, there have been concerns that the results tilt towards more popular articles (here https://www.mediawiki.org/wiki/Topic:Swjyfj59pkjfol7m and here https://www.mediawiki.org/wiki/Topic:Sxy84nxinxqqld2i), but it seems that page traffic is not a variable. Instead, what seems to be happening is that the raw # of similar terms is being used, rather than the % of similar terms. This means that longer articles are favored. Is that a fair assessment?
-J
On Thu, Feb 18, 2016 at 4:15 PM, Gergo Tisza gtisza@wikimedia.org wrote:
On Thu, Feb 18, 2016 at 4:00 PM, Jon Katz jkatz@wikimedia.org wrote:
Can someone on this list point me to where the more-like code sits? Or better, yet would be someone documenting the rules that govern prioritization of suggestions.
I would like to document the logic for our communities so that we can have an open discussion about what variables and weighting we should use to suggest articles.
"More like" is an Elasticsearch https://en.wikipedia.org/wiki/Elasticsearch feature; the documentation is here https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html. I'd imagine the source code is way too complicated to give much insight to the casual reader (as Elasticsearch is a large and complex piece of software) but I never looked into the ES codebase so that's just a guess. The configuration we use for morelike queries is here https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/blob/867248ccf522541922507f23a9ddd0783bed3699/CirrusSearch.php#L450. The wrapper code that fires the ES query is here https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/blob/867248ccf522541922507f23a9ddd0783bed3699/includes/Searcher.php#L800 (but at a glance it doesn't do anything interesting).
Mobile-l mailing list Mobile-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mobile-l