Thanks both! This clarifies a lot. I think the primary issue that editors had raised and I had hoped to explore was popularity/importance v. obscurity.

Specifically, there have been concerns that the results tilt towards more popular articles (here and here), but it seems that page traffic is not a variable. Instead, what seems to be happening is that the raw # of similar terms is being used, rather than the % of similar terms. This means that longer articles are favored. Is that a fair assessment?

-J

On Thu, Feb 18, 2016 at 4:15 PM, Gergo Tisza <gtisza@wikimedia.org> wrote:

On Thu, Feb 18, 2016 at 4:00 PM, Jon Katz <jkatz@wikimedia.org> wrote:
Can someone on this list point me to where the more-like code sits? Or better, yet would be someone documenting the rules that govern prioritization of suggestions.

I would like to document the logic for our communities so that we can have an open discussion about what variables and weighting we should use to suggest articles.

"More like" is an Elasticsearch feature; the documentation is here. I'd imagine the source code is way too complicated to give much insight to the casual reader (as Elasticsearch is a large and complex piece of software) but I never looked into the ES codebase so that's just a guess. The configuration we use for morelike queries is here. The wrapper code that fires the ES query is here (but at a glance it doesn't do anything interesting).