Re: [WikimediaMobile] Similar articles feature performance in CirrusSearch for apps and mobile web

21 Jan 2016

Regarding the caching, we would need to agree between apps and web about
the url and smaxage parameter as Adam noted so that the urls are *exactly* the
same to not bloat varnish and reuse the same cached objects across
platforms.

It is an extremely adhoc and brittle solution but seems like it would be
the greatest win.

20% of the traffic from searches by being only in android and web beta
seems a lot to me, and we should work on reducing it, otherwise when it
hits web stable we're going to crush the servers, so caching seems the
highest priority.

Let's chime in https://phabricator.wikimedia.org/T124216 and continue the
cache discussion there.

Regarding the validity of results with opening text only, how should we
proceed? Adam?

On Wed, Jan 20, 2016 at 9:34 PM, David Causse &lt;dcausse(a)wikimedia.org&gt; wrote:

...
  Hi,

 Yes we can combine many factors, from templates (quality but also
 disambiguation/stubs), size and others.
 Today cirrus uses mostly the number of incoming links which (imho) is not
 very good for morelike.
 On enwiki results will also be scored according the weights defined in
 https://en.wikipedia.org/wiki/MediaWiki:Cirrussearch-boost-templates.

 I wrote a small bash to compare results :
 https://gist.github.com/nomoa/93c5097e3c3cb3b6ebad
 Here is some random results from the list (Semetimes better, sometimes
 worse) :

 $ sh morelike.sh Revolution_Muslim
 Defaults
         "title": "Chess",
         "title": "Suicide attack",
         "title": "Zachary Adam Chesser",
 =======
 Opening text no boost links
         "title": "Hungarian Revolution of 1956",
         "title": "Muslims for America",
         "title": "Salafist Front",

 $ sh morelike.sh Chesser
 Defaults
         "title": "Chess",
         "title": "Edinburgh",
         "title": "Edinburgh Corn Exchange",
 =======
 Opening text no boost links
         "title": "Dreghorn Barracks",
         "title": "Edinburgh Chess Club",
         "title": "Threipmuir Reservoir",

 $ sh morelike.sh Time_%28disambiguation%29
 Defaults
         "title": "Atlantis: The Lost Empire",
         "title": "Stargate",
         "title": "Stargate SG-1",
 =======
 Opening text no boost links
         "title": "Father Time (disambiguation)",
         "title": "The Last Time",
         "title": "Time After Time",

 Le 20/01/2016 19:34, Jon Robson a écrit :

  I'm actually  interested to see whether this
yields better results in
 certain examples where the algorithm is lacking [1]. If it's done as
 an A/B test we could even measure things such as click throughs in the
 related article feature (whether they go up or not)

 Out of interest is it also possible to take article size and type into
 account and not returning any morelike results for things like
 disambiguation pages and stubs?

 [1] https://www.mediawiki.org/wiki/Topic:Swsjajvdll3pf8ya

 On Wed, Jan 20, 2016 at 9:47 AM, Adam Baso &lt;abaso(a)wikimedia.org&gt; wrote:

  One thing we could do regarding the quality of
the output is check
 results
 against a random sample of popular articles (example approach to find
 some
 articles) on mdot Wikipedia. Presuming that improves the quality of the
 recommendations or at least does not degrade them, we should consider
 adding
 the enhancement task to a future sprint, with further instrumentation and
 A/B testing / timeboxed beta test, etc.

 Joaquin, smaxage (e.g., 24 hour cached responses) does seem a good fix
 for
 now for further reduction of client perceived wait, at least for non-cold
 cache requests, even if we stop beating up the backend. Does anyone know
 of
 a compelling reason to not do that for the time being? The main thing
 that
 comes to mind as always is growing the Varnish cache object pool -
 probably
 not a huge deal while the thing is only in beta, but on the stable
 channel
 maybe noteworthy because it would run on probably most pages (but that's
 what edge caches are for, after all).

 Erik, from your perspective does use of smaxage relieve the backend
 sufficiently?

 If we do smaxage, then Web, Android, iOS should standardize their URLs
 so we
 get more cache hits at the edge across all clients. Here's the URL I see
 being used on the web today from mobile web beta:

https://en.m.wikipedia.org/w/api.php?action=query&format=json&forma…

 -Adam

 On Wed, Jan 20, 2016 at 7:45 AM, Joaquin Oltra Hernandez
 &lt;jhernandez(a)wikimedia.org&gt; wrote:

  I'd be up to it if we manage to cram it up in
a following sprint and it
 is
 worth it.

 We could run a controlled test against production with a long batch of
 articles and check median/percentiles response time with repeated runs
 and
 highlight the different results for human inspection regarding quality.

 It's been noted previously that the results are far from ideal (which
 they
 are because it is just morelike), and I think it would be a great idea
 to
 change the endpoint to a specific one that is smarter and has some
 cache (we
 could do much more to get relevant results besides text similarity, take
 into account links, or see also links if there are, etc...).

 As a note, in mobile web the related articles extension allows editors
 to
 specify articles to show in the section, which would avoid queries to
 cirrussearch if it was more used (once rolled into stable I guess).

 I remember that the performance related task was closed as resolved
 (https://phabricator.wikimedia.org/T121254#1907192), should we reopen
 it or
 create a new one?

 I'm not sure if we ended up adding the smaxage parameter (I think we
 didn't), should we? To me it seems a no-brainer that we should be
 caching
 this results in varnish since they don't need to be completely up to
 date
 for this use case.

 On Tue, Jan 19, 2016 at 11:54 PM, Erik Bernhardson
 &lt;ebernhardson(a)wikimedia.org&gt; wrote:

> Both mobile apps and web are using CirrusSearch's morelike: feature
> which
> is showing some performance issues on our end. We would like to make a
> performance optimization to it, but before we would prefer to run an
> A/B
> test to see if the results are still "about as good" as they are
> currently.
>
> The optimization is basically: Currently more like this takes the
> entire
> article into account, we would like to change this to take only the
> opening
> text of an article into account. This should reduce the amount of work
> we
> have to do on the backend saving both server load and latency the user
> sees
> running the query.
>
> This can be triggered by adding these two query parameters to the
> search
> api request that is being performed:
>
> cirrusMltUseFields=yes&cirrusMltFields=opening_text
>
>
> The API will give a warning that these parameters do not exist, but
> they
> are safe to ignore. Would any of you be willing to run this test? We
> would
> basically want to look at user perceived latency along with click
> through
> rates for the current default setup along with the restricted setup
> using
> only opening_text.
>
> Erik B.
>
> _______________________________________________
> Mobile-l mailing list
> Mobile-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/mobile-l
>
>  _______________________________________________
 Mobile-l mailing list
 Mobile-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/mobile-l

 _______________________________________________  Mobile-l mailing list
 Mobile-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/mobile-l

 _______________________________________________
 Mobile-l mailing list
 Mobile-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/mobile-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [WikimediaMobile] Similar articles feature performance in CirrusSearch for apps and mobile web