reply-all is hard...
---------- Forwarded message ----------
From: Erik Bernhardson <ebernhardson(a)wikimedia.org>
Date: Wed, Jan 20, 2016 at 12:14 PM
Subject: Re: [WikimediaMobile] Similar articles feature performance in
CirrusSearch for apps and mobile web
To: Joaquin Oltra Hernandez <jhernandez(a)wikimedia.org>
On Wed, Jan 20, 2016 at 7:45 AM, Joaquin Oltra Hernandez <
jhernandez(a)wikimedia.org> wrote:
I'd be up to it if we manage to cram it up in a
following sprint and it is
worth it.
We could run a controlled test against production with a long batch of
articles and check median/percentiles response time with repeated runs and
highlight the different results for human inspection regarding quality.
I can work this up i think. David and I have done some basic checks of a
few dozen
articles and on average latency was around half using only
opening text. I'll work up something a bit more complete with a few
thousand articles across. One difficulty of this is that morelike
performance changes depending on cluster load. During the busy part of our
day morelike takes 50% longer than at the low points[1]
It's been noted previously that the results are far from ideal (which they
are because it is just *morelike*), and I think it
would be a great idea
to change the endpoint to a specific one that is smarter and has some cache
(we could do much more to get relevant results besides text similarity,
take into account links, or *see also* links if there are, etc...).
We've talked about a dedicated endpoint internally but haven't gotten
anywhere on it. I can fairly easily put together a cirrus specific api
endpoint, it's been hung up deciding if we should instead be putting the
api into core and building up some sort of abstraction around it. Putting
it into core would probably make more sense if we are doing more than the
basic morelike query.
As a note, in mobile web the related articles extension allows editors to
specify articles to show in the section, which would avoid queries to
cirrussearch if it was more used (once rolled into stable I guess).
I remember that the performance related task was closed as resolved (
https://phabricator.wikimedia.org/T121254#1907192), should we reopen it
or create a new one?
I'll create a new one, some performance concerns were addressed there and
we
did see a reduction in server work (average fetch latency cut in half).
Morelike is still accounting for around ~20% of server load, even though it
is only in the 700 qps range (vs 4k for fulltext and 8k for prefix. Note
these are after fanning out to shards, not the number sent to mediawiki).
I'm not sure if we ended up adding the smaxage parameter (I think we didn't
<https://github.com/wikimedia/mediawiki-extensions-RelatedArticles/search?utf8=%E2%9C%93&q=maxage&type=Code>),
should we? To me it seems a no-brainer that we should be caching this
results in varnish since they don't need to be completely up to date for
this use case.
I've been unsure about using smaxage on the search api, due to
fragmentation between how different clients use the api. After further
investigation I've perhaps been worried for no reason.
A relatively naive query in hive suggests in the span of 24h we could cut
morelike queries to the backend from 7.3M to 1.7M:
select sum(total), sum(deduplicated) from (select count(1) as total,
count(distinct requests[0].query) as deduplicated from
wmf_raw.cirrussearchrequestset where year=2016 and month=1 and day=10 and
requests[0].querytype = 'more_like' group by wikiid) x;
_c0 _c1
7331659 1726091
This tries to get a rough estimate on how that compares to the variance in
the way uri's are sent. I'm not sure how good of an approximation this is,
but the totals are similar enough it might be a good guess:
select sum(total), sum(deduplicated) from (select count(1) as total,
count(distinct uri_query) as deduplicated from wmf.webrequest where
year=2016 and month=1 and day=10 and uri_query LIKE '%search=morelike%'
group by uri_host) x;
_c0 _c1
7383599 2214332
In summary to resolve the current load issues we are seeing I will figure
out how to get these cached. I've created
https://phabricator.wikimedia.org/T124216 for discovery to figure that out.
Changing to opening_text would still provide a large benefit for latency on
non-cached pages. It may also help with relevancy, but that is hard to
guesstimate. I still think measuring click through rates could inform the
relevancy decision without too much programming overhead (depending on how
much work it is to add an AB test, i've been told its fairly painless in
the apps?)
On Tue, Jan 19, 2016 at 11:54 PM, Erik Bernhardson <
ebernhardson(a)wikimedia.org> wrote:
Both mobile apps and web are using
CirrusSearch's morelike: feature which
is showing some performance issues on our end. We would like to make a
performance optimization to it, but before we would prefer to run an A/B
test to see if the results are still "about as good" as they are currently.
The optimization is basically: Currently more like this takes the entire
article into account, we would like to change this to take only the opening
text of an article into account. This should reduce the amount of work we
have to do on the backend saving both server load and latency the user sees
running the query.
This can be triggered by adding these two query parameters to the search
api request that is being performed:
cirrusMltUseFields=yes&cirrusMltFields=opening_text
The API will give a warning that these parameters do not exist, but they
are safe to ignore. Would any of you be willing to run this test? We would
basically want to look at user perceived latency along with click through
rates for the current default setup along with the restricted setup using
only opening_text.
Erik B.
_______________________________________________
Mobile-l mailing list
Mobile-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mobile-l
[1]
https://grafana.wikimedia.org/dashboard/db/elasticsearch?panelId=28&ful…