s1 replag update, suggestion, and question - Toolserver-l

8 Aug 2012


      (TL;DR? Skip down three paragraphs to the possible workaround....)  Last
month, I reported on the progress of SHA-1 updates from the WMF servers,
and noted that s1 replag was likely to continue to be a problem for a
number of weeks.  As I said then, the WMF was using (at least) three
processes to populate the SHA-1 field on three separate blocks of
revision records.  All these changes then were being replicated to the
Toolserver's copies of the databases, and this flood of updates was
causing the replag.
The three blocks were being populated at different rates (for reasons
that are beyond my knowledge). On July 23 at about 15:00 UTC, rosemary
(sql-s1-rr) completed updating the first of the three blocks. The other
blocks continued to be populated (and at some point the WMF started
another process to help finish off the slowest block), but the rate of
updates was somewhat less, and rosemary actually caught up on its
backlog and reached zero replag within about a day after this milestone.
The situation on thyme (sql-s1-user) is less favorable, as we all know.
The replag on that server got much higher to start with, and thyme
didn't even reach the end of the first block until Sunday August 5 at
about 12:00 UTC. Unlike the situation with rosemary, the reduced load
after this event did not make any noticeable difference to the replag,
which has continued to increase for the past three days at much the same
rate as before.  The next milestone will be completion of the second
major block, which looks like it will occur either late on Friday August
9 or early on Saturday August 10 UTC, barring any other major problems
(like the WMF server outage on Monday which caused replication at the TS
end to stop for several hours).  At that point, the load from SHA-1
updates should be roughly about 30% of what it had been during July. One
would think that would allow the replag to drop, but since the events of
this week, I can't be confident of that.
There is a possible workaround.  The TS could treat this like a server
outage; copy user databases from thyme to rosemary and then point
sql-s1-user to rosemary, which currently has no replag. Rosemary would
then have to handle twice the load, but thyme should start to recover
very quickly with no user-generated queries hitting it. Once thyme has
recovered, point sql-s1-rr to it.
Downsides: (1) this would require several hours of downtime for
sql-s1-user while the user databases are copied; all tools that require
access to user databases would be offline entirely for this period. (2)
it would have to wait until our volunteer TS admins have time to do it.
(3) the added load on rosemary could cause replag to grow there,
although I doubt it would come anywhere near the 14+ days replag we are
dealing with now on thyme. (4) this could all be unnecessary since thyme
might recover on its own once the SHA-1 update load is reduced, although
I don't know any way of forecasting that and experience so far has not
been encouraging.
Question for those of you who operate and/or use tools that access s1
(enwiki):  would you be willing to accept several hours of service
outage and the other downsides in exchange for getting rid of the 14-day
replag?
-- 
  Russell Blau
  russblau@imapmail.org