On Mon, Jun 10, 2019 at 4:20 PM Yetkin Sakal via Cloud
<cloud(a)lists.wikimedia.org> wrote:
Hello,
What can be done to decrease the lagging on web.db.svc.eqiad.wmflabs? Tools working on
this are severely affected ands up with outdated results.
We did a bit of live investigation with the help of the DBA team on a
related IRC report a few days ago [0]. It turned out that the
replication stalls were happening at that time because 100% of the CPU
of the server providing the *.web.db.svc.eqiad.wmflabs databases was
being used to handle active queries. The replication
threads/subprocesses were waiting their turn to apply changes to the
underlying tables.
One guess from me as to the underlying cause is the combination of the
actor and comment table schema changes. This seems to be supported by
the increasing frequency of lag starting 2019-06-02 [1]. There is no
magic server side fix if this guess is correct. Instead, tools will
need to change the queries that they are running to make more
efficient use of the new tables and/or do searches against a smaller
set of rows at a time in the wiki replicas.
One of the reasons that actor and comment searches are slow is related
to the way that we ensure that content which has been suppressed/rev
deleted is removed from the wiki replicas. This is done in real-time
as part of the views that are publicly exposed over the underlying
tables. In the case of both the actor and comment views, the view is
not aware of the type of entity that a given query is interested in.
It is designed so that you could do a query like `select * from
comment` and it would properly suppress any and all comments that
should not be publicly viewable. To accomplish this, the views for
actor and comment are burdened by subqueries against 8 other tables in
order to ensure that only the correct rows are available.
This multi-table subquery problem is being discussed in T215445 [2],
but since that's work toward a long-term solution, the Cloud Services
team has put up a set of sub-views of these tables to make things
faster for individual queries. Each subview only makes a subquery
against a single table. This way, if you are looking for an actor row
from the logging table, then you can query against the actor_logging
view. This view will be more efficient for that case. The
actor_logging view will, naturally, not have any rows available that
might be referenced instead in the revision table or the archive
table.
These new views have been documented on wikitech [3], but we haven't
done a messaging push to advertise them to the wiki replica users yet.
Hopefully a few folks will read about them here in this thread! The
Cloud Services team will also make a cloud-announce posting about this
soon.
[0]:
https://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-cloud/20190607.txt
(starting around 20:43)
[1]:
https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20…
[2]:
https://phabricator.wikimedia.org/T215445
[3]:
https://wikitech.wikimedia.org/wiki/News/Actor_storage_changes_on_the_Wiki_…
Bryan
--
Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org>
[[m:User:BDavis_(WMF)]] Manager, Technical Engagement Boise, ID USA
irc: bd808 v:415.839.6885 x6855