(TL;DR? Skip down three paragraphs to the possible workaround....) Last month, I reported on the progress of SHA-1 updates from the WMF servers, and noted that s1 replag was likely to continue to be a problem for a number of weeks. As I said then, the WMF was using (at least) three processes to populate the SHA-1 field on three separate blocks of revision records. All these changes then were being replicated to the Toolserver's copies of the databases, and this flood of updates was causing the replag.
The three blocks were being populated at different rates (for reasons that are beyond my knowledge). On July 23 at about 15:00 UTC, rosemary (sql-s1-rr) completed updating the first of the three blocks. The other blocks continued to be populated (and at some point the WMF started another process to help finish off the slowest block), but the rate of updates was somewhat less, and rosemary actually caught up on its backlog and reached zero replag within about a day after this milestone.
The situation on thyme (sql-s1-user) is less favorable, as we all know. The replag on that server got much higher to start with, and thyme didn't even reach the end of the first block until Sunday August 5 at about 12:00 UTC. Unlike the situation with rosemary, the reduced load after this event did not make any noticeable difference to the replag, which has continued to increase for the past three days at much the same rate as before. The next milestone will be completion of the second major block, which looks like it will occur either late on Friday August 9 or early on Saturday August 10 UTC, barring any other major problems (like the WMF server outage on Monday which caused replication at the TS end to stop for several hours). At that point, the load from SHA-1 updates should be roughly about 30% of what it had been during July. One would think that would allow the replag to drop, but since the events of this week, I can't be confident of that.
There is a possible workaround. The TS could treat this like a server outage; copy user databases from thyme to rosemary and then point sql-s1-user to rosemary, which currently has no replag. Rosemary would then have to handle twice the load, but thyme should start to recover very quickly with no user-generated queries hitting it. Once thyme has recovered, point sql-s1-rr to it.
Downsides: (1) this would require several hours of downtime for sql-s1-user while the user databases are copied; all tools that require access to user databases would be offline entirely for this period. (2) it would have to wait until our volunteer TS admins have time to do it. (3) the added load on rosemary could cause replag to grow there, although I doubt it would come anywhere near the 14+ days replag we are dealing with now on thyme. (4) this could all be unnecessary since thyme might recover on its own once the SHA-1 update load is reduced, although I don't know any way of forecasting that and experience so far has not been encouraging.
Question for those of you who operate and/or use tools that access s1 (enwiki): would you be willing to accept several hours of service outage and the other downsides in exchange for getting rid of the 14-day replag?
I'm a little confused as to which DB server we are talking about. I need access to
enwiki-p.db.toolserver.org hap-s1-user.esi.toolserver.org.
is that sql-s1-user or sql-s1-rr or what?
Daniel
On Wed, Aug 8, 2012 at 7:46 AM, Russell Blau russblau@imapmail.org wrote:
(TL;DR? Skip down three paragraphs to the possible workaround....) Last month, I reported on the progress of SHA-1 updates from the WMF servers, and noted that s1 replag was likely to continue to be a problem for a number of weeks. As I said then, the WMF was using (at least) three processes to populate the SHA-1 field on three separate blocks of revision records. All these changes then were being replicated to the Toolserver's copies of the databases, and this flood of updates was causing the replag.
The three blocks were being populated at different rates (for reasons that are beyond my knowledge). On July 23 at about 15:00 UTC, rosemary (sql-s1-rr) completed updating the first of the three blocks. The other blocks continued to be populated (and at some point the WMF started another process to help finish off the slowest block), but the rate of updates was somewhat less, and rosemary actually caught up on its backlog and reached zero replag within about a day after this milestone.
The situation on thyme (sql-s1-user) is less favorable, as we all know. The replag on that server got much higher to start with, and thyme didn't even reach the end of the first block until Sunday August 5 at about 12:00 UTC. Unlike the situation with rosemary, the reduced load after this event did not make any noticeable difference to the replag, which has continued to increase for the past three days at much the same rate as before. The next milestone will be completion of the second major block, which looks like it will occur either late on Friday August 9 or early on Saturday August 10 UTC, barring any other major problems (like the WMF server outage on Monday which caused replication at the TS end to stop for several hours). At that point, the load from SHA-1 updates should be roughly about 30% of what it had been during July. One would think that would allow the replag to drop, but since the events of this week, I can't be confident of that.
There is a possible workaround. The TS could treat this like a server outage; copy user databases from thyme to rosemary and then point sql-s1-user to rosemary, which currently has no replag. Rosemary would then have to handle twice the load, but thyme should start to recover very quickly with no user-generated queries hitting it. Once thyme has recovered, point sql-s1-rr to it.
Downsides: (1) this would require several hours of downtime for sql-s1-user while the user databases are copied; all tools that require access to user databases would be offline entirely for this period. (2) it would have to wait until our volunteer TS admins have time to do it. (3) the added load on rosemary could cause replag to grow there, although I doubt it would come anywhere near the 14+ days replag we are dealing with now on thyme. (4) this could all be unnecessary since thyme might recover on its own once the SHA-1 update load is reduced, although I don't know any way of forecasting that and experience so far has not been encouraging.
Question for those of you who operate and/or use tools that access s1 (enwiki): would you be willing to accept several hours of service outage and the other downsides in exchange for getting rid of the 14-day replag? -- Russell Blau russblau@imapmail.org
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
On Wed, Aug 8, 2012, at 11:35 AM, Daniel Schwen wrote:
I'm a little confused as to which DB server we are talking about. I need access to
enwiki-p.db.toolserver.org hap-s1-user.esi.toolserver.org.
is that sql-s1-user or sql-s1-rr or what?
Daniel
Daniel - there are two copies of enwiki-p; see [1] for details. If you need access to any database whose name starts with "u_" or "p_", you need the sql-s1-user copy. If you *don't* need access to those databases, you ought be using the sql-s1-rr copy and you are degrading the performance of your application if you don't.
The address "enwiki-p.db.toolserver.org" points to the sql-s1-user copy, and is deprecated; you ought to be using either "enwiki-p.rrdb.toolserver.org" or "enwiki-p.userdb.toolserver.org" instead.
[1] https://wiki.toolserver.org/view/Database_access
On Wed, Aug 8, 2012 at 3:46 PM, Russell Blau russblau@imapmail.org wrote:
There is a possible workaround. The TS could treat this like a server outage; copy user databases from thyme to rosemary and then point sql-s1-user to rosemary, which currently has no replag. Rosemary would then have to handle twice the load, but thyme should start to recover very quickly with no user-generated queries hitting it. Once thyme has recovered, point sql-s1-rr to it.
Downsides: (1) this would require several hours of downtime for sql-s1-user while the user databases are copied; all tools that require access to user databases would be offline entirely for this period. (2) it would have to wait until our volunteer TS admins have time to do it.
Actually, it could probably be reduced from "downtime" to "readonly user databases". If thyme were writing at the binlog, it could probably stay accepting writes for the most part of it. This comes at the expense of TS admin time, of course.
(3) the added load on rosemary could cause replag to grow there, although I doubt it would come anywhere near the 14+ days replag we are dealing with now on thyme.
Depending on the insert speed without queries, another option would be the time needed for copying the db from rosemary to thyme. (I'm assuming it would be much slower than the downtime moving user dbs but it's just a guess, if it weren't this could replace that move).
On Wed, Aug 8, 2012, at 11:59 AM, Platonides wrote:
Depending on the insert speed without queries, another option would be the time needed for copying the db from rosemary to thyme. (I'm assuming it would be much slower than the downtime moving user dbs but it's just a guess, if it weren't this could replace that move).
Well, based on the overwhelming response to my last message, I guess nobody but me cares if thyme is lagged by three or four or five weeks....
Thyme finished processing the updates in the second block a few hours ago, but the replag is continuing to increase. This is very worrisome, and possibly there is something else going on there that the SHA-1 updates have been masking. All the TS admins seem to be on summer holiday; is there anyone around who has mysql root access and can look for problems on thyme?
Hello, At Friday 10 August 2012 16:34:07 DaB. wrote:
All the TS admins seem to be on summer holiday; is there anyone around who has mysql root access and can look for problems on thyme?
I killed a few very-long-runners on thyme and AFAIS the replag is decreasing slowly.
Sorry for the non-response on my side the last days, but I was busy with non- TS-stuff (and Nosy has another very-important thing to do at the moment :-))
Sincerely, DaB.
On Fri, Aug 10, 2012 at 8:19 AM, Russell Blau russblau@imapmail.org wrote:
Well, based on the overwhelming response to my last message, I guess nobody but me cares if thyme is lagged by three or four or five weeks....
I found your post very helpful for a status update of the current situation. The lag has a huge effect on the WP 1.0 bot that is used to track article assessments on enwiki, and which has a large user database as well.
Thyme finished processing the updates in the second block a few hours ago, but the replag is continuing to increase. This is very worrisome, and possibly there is something else going on there that the SHA-1 updates have been masking. All the TS admins seem to be on summer holiday; is there anyone around who has mysql root access and can look for problems on thyme?
Just to see if it makes any difference I killed the running WP 1.0 process on thyme. Right now the replag seems to be decreasing at a tiny rate, less than 10 minutes per hour. There are 411 hours of replag.
For what it's worth, I would personally prefer a short complete outage (or make the server read-only) if that would leave us with no replag, rather than waiting for weeks or months while the replag slowly decreases.
- Carl
On Fri, Aug 10, 2012, at 03:42 PM, Carl (CBM) wrote:
Just to see if it makes any difference I killed the running WP 1.0 process on thyme. Right now the replag seems to be decreasing at a tiny rate, less than 10 minutes per hour. There are 411 hours of replag.
Carl, that is a good idea and I've stopped all the scheduled dpl project jobs that usually run on thyme, for 24 hours, to see if that helps. If anyone else could temporarily shut down their tools to help reduce the load on the server, maybe we can all help improve the recovery rate.
On Fri, Aug 10, 2012 at 3:42 PM, Carl (CBM) cbm.wikipedia@gmail.com wrote:
On Fri, Aug 10, 2012 at 8:19 AM, Russell Blau russblau@imapmail.org wrote:
Well, based on the overwhelming response to my last message, I guess nobody but me cares if thyme is lagged by three or four or five weeks....
I found your post very helpful for a status update of the current situation. The lag has a huge effect on the WP 1.0 bot that is used to track article assessments on enwiki, and which has a large user database as well.
+1
toolserver-l@lists.wikimedia.org