On Mon, Jun 10, 2019 at 4:20 PM Yetkin Sakal via Cloud cloud@lists.wikimedia.org wrote:
Hello,
What can be done to decrease the lagging on web.db.svc.eqiad.wmflabs? Tools working on this are severely affected ands up with outdated results.
We did a bit of live investigation with the help of the DBA team on a related IRC report a few days ago [0]. It turned out that the replication stalls were happening at that time because 100% of the CPU of the server providing the *.web.db.svc.eqiad.wmflabs databases was being used to handle active queries. The replication threads/subprocesses were waiting their turn to apply changes to the underlying tables.
One guess from me as to the underlying cause is the combination of the actor and comment table schema changes. This seems to be supported by the increasing frequency of lag starting 2019-06-02 [1]. There is no magic server side fix if this guess is correct. Instead, tools will need to change the queries that they are running to make more efficient use of the new tables and/or do searches against a smaller set of rows at a time in the wiki replicas.
One of the reasons that actor and comment searches are slow is related to the way that we ensure that content which has been suppressed/rev deleted is removed from the wiki replicas. This is done in real-time as part of the views that are publicly exposed over the underlying tables. In the case of both the actor and comment views, the view is not aware of the type of entity that a given query is interested in. It is designed so that you could do a query like `select * from comment` and it would properly suppress any and all comments that should not be publicly viewable. To accomplish this, the views for actor and comment are burdened by subqueries against 8 other tables in order to ensure that only the correct rows are available.
This multi-table subquery problem is being discussed in T215445 [2], but since that's work toward a long-term solution, the Cloud Services team has put up a set of sub-views of these tables to make things faster for individual queries. Each subview only makes a subquery against a single table. This way, if you are looking for an actor row from the logging table, then you can query against the actor_logging view. This view will be more efficient for that case. The actor_logging view will, naturally, not have any rows available that might be referenced instead in the revision table or the archive table.
These new views have been documented on wikitech [3], but we haven't done a messaging push to advertise them to the wiki replica users yet. Hopefully a few folks will read about them here in this thread! The Cloud Services team will also make a cloud-announce posting about this soon.
[0]: https://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-cloud/20190607.txt (starting around 20:43) [1]: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20p... [2]: https://phabricator.wikimedia.org/T215445 [3]: https://wikitech.wikimedia.org/wiki/News/Actor_storage_changes_on_the_Wiki_R...
Bryan