excessive lagging

List overview All Threads
Download

newer

older

Toolforge SQL performance issue...

Another SQL optimization request

Yetkin Sakal

10 Jun 2019 10 Jun '19

3:20 p.m.

Hello, What can be done to decrease the lagging on web.db.svc.eqiad.wmflabs? Tools working on this are severely affected ands up with outdated results. Regards.

Attachments:

attachment.htm (text/html — 420 bytes)

Show replies by date

John

10 Jun 10 Jun

3:24 p.m.

Can you provide more information? Most of the time lag is minimal. enwiki is at 7 seconds, dewiki is at 0.

On Mon, Jun 10, 2019 at 11:20 AM Yetkin Sakal via Cloud < cloud@lists.wikimedia.org> wrote:

...

Hello,

What can be done to decrease the lagging on web.db.svc.eqiad.wmflabs? Tools working on this are severely affected ands up with outdated results.

Regards. _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Manuel Arostegui

3:27 p.m.

Hello,

We currently have some issues going on. One of the hosts had issues with s4 (commonswiki) https://phabricator.wikimedia.org/T225390

And then S3 replica's master (which holds like 900 wikis) had hardware issues we are still trying to figure out https://phabricator.wikimedia.org/T225391 so that means that those S3 wikis are currently not getting the new data.

We are trying to sort things out. Sorry for the inconveniences Manuel.

El lun., 10 jun. 2019 16:20, Yetkin Sakal via Cloud < cloud@lists.wikimedia.org> escribió:

...

Hello,

What can be done to decrease the lagging on web.db.svc.eqiad.wmflabs? Tools working on this are severely affected ands up with outdated results.

Regards. _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Yetkin Sakal

3:35 p.m.

All shards are affected. Please view the heartbeat report at https://tools.wmflabs.org/replag/ Thanks. On Monday, June 10, 2019, 6:27:44 PM GMT+3, Manuel Arostegui marostegui@wikimedia.org wrote:

Hello, We currently have some issues going on.One of the hosts had issues with s4 (commonswiki) https://phabricator.wikimedia.org/T225390 And then S3 replica's master (which holds like 900 wikis) had hardware issues we are still trying to figure out https://phabricator.wikimedia.org/T225391 so that means that those S3 wikis are currently not getting the new data. We are trying to sort things out.Sorry for the inconveniencesManuel. El lun., 10 jun. 2019 16:20, Yetkin Sakal via Cloud cloud@lists.wikimedia.org escribió:

Hello, What can be done to decrease the lagging on web.db.svc.eqiad.wmflabs? Tools working on this are severely affected ands up with outdated results. Regards. _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Manuel Arostegui

3:41 p.m.

We also have one host down for maintenance https://phabricator.wikimedia.org/T222978 which puts more load on the other two.

El lun., 10 jun. 2019 16:35, Yetkin Sakal superyetkin@yahoo.com escribió:

...

All shards are affected. Please view the heartbeat report at https://tools.wmflabs.org/replag/

Thanks. On Monday, June 10, 2019, 6:27:44 PM GMT+3, Manuel Arostegui < marostegui@wikimedia.org> wrote:

Hello,

We currently have some issues going on. One of the hosts had issues with s4 (commonswiki) https://phabricator.wikimedia.org/T225390

And then S3 replica's master (which holds like 900 wikis) had hardware issues we are still trying to figure out https://phabricator.wikimedia.org/T225391 so that means that those S3 wikis are currently not getting the new data.

We are trying to sort things out. Sorry for the inconveniences Manuel.

El lun., 10 jun. 2019 16:20, Yetkin Sakal via Cloud < cloud@lists.wikimedia.org> escribió:

Hello,

What can be done to decrease the lagging on web.db.svc.eqiad.wmflabs? Tools working on this are severely affected ands up with outdated results.

Regards. _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Bryan Davis

3:46 p.m.

On Mon, Jun 10, 2019 at 4:20 PM Yetkin Sakal via Cloud cloud@lists.wikimedia.org wrote:

...

Hello,

What can be done to decrease the lagging on web.db.svc.eqiad.wmflabs? Tools working on this are severely affected ands up with outdated results.

We did a bit of live investigation with the help of the DBA team on a related IRC report a few days ago [0]. It turned out that the replication stalls were happening at that time because 100% of the CPU of the server providing the *.web.db.svc.eqiad.wmflabs databases was being used to handle active queries. The replication threads/subprocesses were waiting their turn to apply changes to the underlying tables.

One guess from me as to the underlying cause is the combination of the actor and comment table schema changes. This seems to be supported by the increasing frequency of lag starting 2019-06-02 [1]. There is no magic server side fix if this guess is correct. Instead, tools will need to change the queries that they are running to make more efficient use of the new tables and/or do searches against a smaller set of rows at a time in the wiki replicas.

One of the reasons that actor and comment searches are slow is related to the way that we ensure that content which has been suppressed/rev deleted is removed from the wiki replicas. This is done in real-time as part of the views that are publicly exposed over the underlying tables. In the case of both the actor and comment views, the view is not aware of the type of entity that a given query is interested in. It is designed so that you could do a query like `select * from comment` and it would properly suppress any and all comments that should not be publicly viewable. To accomplish this, the views for actor and comment are burdened by subqueries against 8 other tables in order to ensure that only the correct rows are available.

This multi-table subquery problem is being discussed in T215445 [2], but since that's work toward a long-term solution, the Cloud Services team has put up a set of sub-views of these tables to make things faster for individual queries. Each subview only makes a subquery against a single table. This way, if you are looking for an actor row from the logging table, then you can query against the actor_logging view. This view will be more efficient for that case. The actor_logging view will, naturally, not have any rows available that might be referenced instead in the revision table or the archive table.

These new views have been documented on wikitech [3], but we haven't done a messaging push to advertise them to the wiki replica users yet. Hopefully a few folks will read about them here in this thread! The Cloud Services team will also make a cloud-announce posting about this soon.

[0]: https://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-cloud/20190607.txt (starting around 20:43) [1]: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20p... [2]: https://phabricator.wikimedia.org/T215445 [3]: https://wikitech.wikimedia.org/wiki/News/Actor_storage_changes_on_the_Wiki_R...

Bryan

-- Bryan Davis Wikimedia Foundation bd808@wikimedia.org [[m:User:BDavis_(WMF)]] Manager, Technical Engagement Boise, ID USA irc: bd808 v:415.839.6885 x6855

1999

Age (days ago)

1999

Last active (days ago)

cloud@lists.wikimedia.org

5 comments

4 participants

tags (0)

participants (4)

Bryan Davis
John
Manuel Arostegui
Yetkin Sakal