External storage consistency error - Wikitech-l

2 Dec 2006


      -----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
A problem was reported to us today which I have tracked down to a
configuration consistency problem.
First some background:
We use "external storage" servers to store the bulk of page text
contents for Wikimedia's wikis. These are web servers where we make use
of otherwise unused disk space by sticking on a copy of MySQL and
storing compressed text blobs.
For reliability, they come in clusters, using MySQL's replication. So if
one dies, we can shuffle them about with no loss of data.
New pages get saved into the most recent one or two ES clusters, and
every once in a while we close an old one off and add a new one on the end.
The current write clusters are cluster8 and cluster9. The DB master for
cluster8 was srv89, which had recently had some problems: it was
apparently overloaded, and didn't respond to login attempts.
On November 28, we had it rebooted, but it didn't come back online.
Cluster8 was reshuffled, with the master moved to srv87.
At about 22:30 UTC, srv89 came back online. Unfortunately this meant
that its web server came online as well, and apparently made it back
into the load balancing pool -- but with old copies of the MediaWiki
configuration.
srv89's MediaWiki still had the configuration files telling it to save
text blobs into the external storage server on... srv89, itself.
So, some portion of pages saved during an hour or so had their items
saved into srv89 instead of srv87. When other web servers then came to
read the pages back, they read from srv87 or its clone srv88, and
received a _different page_ with the _same blob ID number_.
The good news is that the data is probably recoverable, on this procedure:
* For each wiki, list blobs which differ between srv87 and srv88
* For each mismatched blob, locate the two (or more) revisions using it
* Disambiguate them somehow... (probably mostly automatable... i hope!)
* Copy the srv88 blobs onto srv87, and reassign the revisions to the
appropriate new blob ID numbers
I'm not sure how many pages are affected, but it appears to be 25 edits
on fr.wikipedia.org, so perhaps a few hundred overall.
This could most likely have been avoided by forcing web servers to sync
their MediaWiki configurations before starting Apache on boot.
- -- brion vibber (brion @ pobox.com / brion @ wikimedia.org)
...PGP SIGNATURE...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFcMQcwRnhpk1wk44RAgp0AKDgCNa8JWjmGmK0QhSuloQMYLPjdgCglhf6
uPsFhWJFxhXuvOFUdLzlgPE=
=HhJ5
-----END PGP SIGNATURE-----