[Labs-l] A tale of three databases

Nuria Ruiz nuria at wikimedia.org
Tue Sep 23 16:27:47 UTC 2014


Second that, getting more visibility in how labs is set up is very
educational.

On Tue, Sep 23, 2014 at 9:14 AM, Denny Vrandečić <vrandecic at gmail.com>
wrote:

> Thank you for the postmortem! Such often contain very valuable lessons,
> and I am glad you chose to write it down and share it so openly. This
> deserves kudos!
>
>
>
> On Tue, Sep 23, 2014 at 6:46 AM, Marc A. Pelletier <marc at uberbox.org>
> wrote:
>
>> [Or; an outage report in three acts]
>>
>> So, what happened over the last couple of days that have caused so many
>> small issues with the replica databases?  In order to make that clear,
>> I'll explain a bit how the replicas are structured.
>>
>> At the dawn of time, the production replicas were set up as a
>> small-scale copy of how production itself is set up, with the various
>> project DBs split in seven "slices" to spread load.  Those seven slices
>> ran on three (physical) servers, and each held a replica of its
>> production equivalent.  (This is what everyone saw as "s1" - "s7").
>>
>> Now, in order to allow tools that don't understand that more than one
>> database can live on the same physical server to work without needing
>> adaptation (to ease transition from the toolserver), I set up a set of
>> ugly networking rules[1] that made those three servers appear to be
>> seven different ones - allowing code to pretend that just changing the
>> address gets you to a different server.
>>
>> Enter MariaDB.
>>
>> Now, MariaDB is a very nice improvement for everyone: not only does it
>> allow us to mirror /every/ slice on all three servers (allowing easy
>> joins between databases), but it does so faster and more reliably than
>> vanilla mysql could thanks to a new database engine (TokuDB).  What this
>> meant is that we no longer needed to run seven mysql instances but just
>> one per server, each having a copy of every production database.
>>
>> So, Sean (our DBA), set about converting our setup to MariaDB and
>> merging the databases that used to live on every server to a single one.
>>  This worked well, with only minor problems caused by some slight
>> behaviour differences between mysql and mariadb or between innodb (the
>> previous database engine) and tokudb.  Two of the servers were completed
>> that way with the third soon to be done once the kinks were worked out[2].
>>
>> Fast forward several weeks and a second, unrelated issue was on the
>> plate to fix.  You see, of the three database servers one had been set
>> up in the wrong place in the datacenter[3]; it worked, but because it
>> was there it kept needed special exceptions in the firewall rules which
>> was not only a maintenance issue, but was error prone and less secure.
>>
>> Fixing /that/ would be a simple thing; it only needs a short downtime
>> while someone actually physically hauls the hardware from one place in
>> the datacenter to another; and change its IP address.
>>
>> That went well, and in less than an hour the database was sitting
>> happily in its new rack with its new IP address.
>>
>> Now, at that point, the networking configuration needs to be changed
>> anyways, and since the databases had been merged[4], it was obvious that
>> this was the right time to rip out the ugly networking rules that had
>> become noops and by now just added a layer of needless complexity.
>>
>> That also went well, except for one niggling detail[5]: the databases on
>> the third servers never /did/ get merged like the other two.  Removing
>> the networking rules had no effect on the first two (as expected) but
>> now only the first of three databases on the third was accessible.
>>
>> Worse: it *looks* like the other two databases are still happily working
>> since you apparently can still connect to them (but end up connected to
>> the wrong one).
>>
>> So, the change is made accompanied with some tests and all seems fine,
>> because, out of the dozen or so project databases I tested, I didn't
>> happen to test connecting to a database that used to be on the two out
>> of seven slices that are no longer visible.
>>
>> Monday comes, panic ensues.  In the end, we decided to merge the
>> databases on the third server as the fix (that took around a day), and
>> we're back to working status with everything done.
>>
>> Like all good tales, this one has a moral[6].  No change is so obvious
>> that it doesn't require careful planning.  The disruption over the
>> weekend was due only to the fact that I didn't take the time to double
>> check my assumptions because the change was "trivial".
>>
>> Or, as I learned while wiping the egg from my face, would have *been*
>> trivial if my assumptions matched reality.
>>
>> Exit sysadmin stage left, head hung low in shame at his hubris exposed.
>>
>> -- Marc
>>
>> [1] The "iptable rules" you may have heard mentionned on occasions.
>> Basically, just a set of NAT rules to redirect faux IPs standing in for
>> the servers to the right IP and port.
>>
>> [2] Pay attention here, that's some skillful foreshadowing right there.
>>
>> [3] Moved from one row of eqiad to another, for those keeping score.
>>
>> [4] If you've been following at home, you already see where this is
>> heading.
>>
>> [5] Also, the change was done on a Friday.  "But it's just a trivial
>> change!"
>>
>> [6] Well, two morals if you count the "Don't do a change before you
>> leave for the weekend!" beating I also gave myself.
>>
>>
>> _______________________________________________
>> Labs-l mailing list
>> Labs-l at lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/labs-l
>>
>
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/labs-l/attachments/20140923/882f7ce0/attachment-0001.html>


More information about the Labs-l mailing list