<div dir="ltr">Second that, getting more visibility in how labs is set up is very educational.</div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Sep 23, 2014 at 9:14 AM, Denny Vrandečić <span dir="ltr"><<a href="mailto:vrandecic@gmail.com" target="_blank">vrandecic@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Thank you for the postmortem! Such often contain very valuable lessons, and I am glad you chose to write it down and share it so openly. This deserves kudos!<div><br></div><div><br></div></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Sep 23, 2014 at 6:46 AM, Marc A. Pelletier <span dir="ltr"><<a href="mailto:marc@uberbox.org" target="_blank">marc@uberbox.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">[Or; an outage report in three acts]<br>
<br>
So, what happened over the last couple of days that have caused so many<br>
small issues with the replica databases? In order to make that clear,<br>
I'll explain a bit how the replicas are structured.<br>
<br>
At the dawn of time, the production replicas were set up as a<br>
small-scale copy of how production itself is set up, with the various<br>
project DBs split in seven "slices" to spread load. Those seven slices<br>
ran on three (physical) servers, and each held a replica of its<br>
production equivalent. (This is what everyone saw as "s1" - "s7").<br>
<br>
Now, in order to allow tools that don't understand that more than one<br>
database can live on the same physical server to work without needing<br>
adaptation (to ease transition from the toolserver), I set up a set of<br>
ugly networking rules[1] that made those three servers appear to be<br>
seven different ones - allowing code to pretend that just changing the<br>
address gets you to a different server.<br>
<br>
Enter MariaDB.<br>
<br>
Now, MariaDB is a very nice improvement for everyone: not only does it<br>
allow us to mirror /every/ slice on all three servers (allowing easy<br>
joins between databases), but it does so faster and more reliably than<br>
vanilla mysql could thanks to a new database engine (TokuDB). What this<br>
meant is that we no longer needed to run seven mysql instances but just<br>
one per server, each having a copy of every production database.<br>
<br>
So, Sean (our DBA), set about converting our setup to MariaDB and<br>
merging the databases that used to live on every server to a single one.<br>
This worked well, with only minor problems caused by some slight<br>
behaviour differences between mysql and mariadb or between innodb (the<br>
previous database engine) and tokudb. Two of the servers were completed<br>
that way with the third soon to be done once the kinks were worked out[2].<br>
<br>
Fast forward several weeks and a second, unrelated issue was on the<br>
plate to fix. You see, of the three database servers one had been set<br>
up in the wrong place in the datacenter[3]; it worked, but because it<br>
was there it kept needed special exceptions in the firewall rules which<br>
was not only a maintenance issue, but was error prone and less secure.<br>
<br>
Fixing /that/ would be a simple thing; it only needs a short downtime<br>
while someone actually physically hauls the hardware from one place in<br>
the datacenter to another; and change its IP address.<br>
<br>
That went well, and in less than an hour the database was sitting<br>
happily in its new rack with its new IP address.<br>
<br>
Now, at that point, the networking configuration needs to be changed<br>
anyways, and since the databases had been merged[4], it was obvious that<br>
this was the right time to rip out the ugly networking rules that had<br>
become noops and by now just added a layer of needless complexity.<br>
<br>
That also went well, except for one niggling detail[5]: the databases on<br>
the third servers never /did/ get merged like the other two. Removing<br>
the networking rules had no effect on the first two (as expected) but<br>
now only the first of three databases on the third was accessible.<br>
<br>
Worse: it *looks* like the other two databases are still happily working<br>
since you apparently can still connect to them (but end up connected to<br>
the wrong one).<br>
<br>
So, the change is made accompanied with some tests and all seems fine,<br>
because, out of the dozen or so project databases I tested, I didn't<br>
happen to test connecting to a database that used to be on the two out<br>
of seven slices that are no longer visible.<br>
<br>
Monday comes, panic ensues. In the end, we decided to merge the<br>
databases on the third server as the fix (that took around a day), and<br>
we're back to working status with everything done.<br>
<br>
Like all good tales, this one has a moral[6]. No change is so obvious<br>
that it doesn't require careful planning. The disruption over the<br>
weekend was due only to the fact that I didn't take the time to double<br>
check my assumptions because the change was "trivial".<br>
<br>
Or, as I learned while wiping the egg from my face, would have *been*<br>
trivial if my assumptions matched reality.<br>
<br>
Exit sysadmin stage left, head hung low in shame at his hubris exposed.<br>
<br>
-- Marc<br>
<br>
[1] The "iptable rules" you may have heard mentionned on occasions.<br>
Basically, just a set of NAT rules to redirect faux IPs standing in for<br>
the servers to the right IP and port.<br>
<br>
[2] Pay attention here, that's some skillful foreshadowing right there.<br>
<br>
[3] Moved from one row of eqiad to another, for those keeping score.<br>
<br>
[4] If you've been following at home, you already see where this is heading.<br>
<br>
[5] Also, the change was done on a Friday. "But it's just a trivial<br>
change!"<br>
<br>
[6] Well, two morals if you count the "Don't do a change before you<br>
leave for the weekend!" beating I also gave myself.<br>
<br>
<br>
_______________________________________________<br>
Labs-l mailing list<br>
<a href="mailto:Labs-l@lists.wikimedia.org" target="_blank">Labs-l@lists.wikimedia.org</a><br>
<a href="https://lists.wikimedia.org/mailman/listinfo/labs-l" target="_blank">https://lists.wikimedia.org/mailman/listinfo/labs-l</a><br>
</blockquote></div><br></div>
</div></div><br>_______________________________________________<br>
Labs-l mailing list<br>
<a href="mailto:Labs-l@lists.wikimedia.org">Labs-l@lists.wikimedia.org</a><br>
<a href="https://lists.wikimedia.org/mailman/listinfo/labs-l" target="_blank">https://lists.wikimedia.org/mailman/listinfo/labs-l</a><br>
<br></blockquote></div><br></div>