-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
A problem was reported to us today which I have tracked down to a configuration consistency problem.
First some background:
We use "external storage" servers to store the bulk of page text contents for Wikimedia's wikis. These are web servers where we make use of otherwise unused disk space by sticking on a copy of MySQL and storing compressed text blobs.
For reliability, they come in clusters, using MySQL's replication. So if one dies, we can shuffle them about with no loss of data.
New pages get saved into the most recent one or two ES clusters, and every once in a while we close an old one off and add a new one on the end.
The current write clusters are cluster8 and cluster9. The DB master for cluster8 was srv89, which had recently had some problems: it was apparently overloaded, and didn't respond to login attempts.
On November 28, we had it rebooted, but it didn't come back online. Cluster8 was reshuffled, with the master moved to srv87.
At about 22:30 UTC, srv89 came back online. Unfortunately this meant that its web server came online as well, and apparently made it back into the load balancing pool -- but with old copies of the MediaWiki configuration.
srv89's MediaWiki still had the configuration files telling it to save text blobs into the external storage server on... srv89, itself.
So, some portion of pages saved during an hour or so had their items saved into srv89 instead of srv87. When other web servers then came to read the pages back, they read from srv87 or its clone srv88, and received a _different page_ with the _same blob ID number_.
The good news is that the data is probably recoverable, on this procedure: * For each wiki, list blobs which differ between srv87 and srv88 * For each mismatched blob, locate the two (or more) revisions using it * Disambiguate them somehow... (probably mostly automatable... i hope!) * Copy the srv88 blobs onto srv87, and reassign the revisions to the appropriate new blob ID numbers
I'm not sure how many pages are affected, but it appears to be 25 edits on fr.wikipedia.org, so perhaps a few hundred overall.
This could most likely have been avoided by forcing web servers to sync their MediaWiki configurations before starting Apache on boot.
- -- brion vibber (brion @ pobox.com / brion @ wikimedia.org)
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Brion Vibber wrote:
The current write clusters are cluster8 and cluster9. The DB master for cluster8 was srv89, which had recently had some problems: it was apparently overloaded, and didn't respond to login attempts.
On November 28, we had it rebooted, but it didn't come back online. Cluster8 was reshuffled, with the master moved to srv87.
[Please swap srv87 and srv88 in the previous message. srv88 is the new master, srv87 the slave. srv89 is still the broken one. ;)]
- -- brion vibber (brion @ pobox.com)
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Brion Vibber wrote:
So, some portion of pages saved during an hour or so had their items saved into srv89 instead of srv87. When other web servers then came to read the pages back, they read from srv87 or its clone srv88, and received a _different page_ with the _same blob ID number_.
Three other down web servers were brought up at the same time, with the same bad configuration. I've resynced them all now, so there should be no more things falling into the wrong server.
Will try to have the broken ones recovered tonight or tomorrow.
- -- brion vibber (brion @ pobox.com)
On Fri, Dec 01, 2006 at 04:33:47PM -0800, Brion Vibber wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Brion Vibber wrote:
So, some portion of pages saved during an hour or so had their items saved into srv89 instead of srv87. When other web servers then came to read the pages back, they read from srv87 or its clone srv88, and received a _different page_ with the _same blob ID number_.
Three other down web servers were brought up at the same time, with the same bad configuration. I've resynced them all now, so there should be no more things falling into the wrong server.
Will try to have the broken ones recovered tonight or tomorrow.
At bootup, /home/wikipedia/sbin/post-boot-config-sync.sh is being called. Should we add scap15-1 to that script, so that the system updates the software when booting up?
Regards,
jens
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Jens Frank wrote:
At bootup, /home/wikipedia/sbin/post-boot-config-sync.sh is being called. Should we add scap15-1 to that script, so that the system updates the software when booting up?
Yes, I think that would be wise.
- -- brion vibber (brion @ pobox.com)
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Brion Vibber wrote:
Brion Vibber wrote:
So, some portion of pages saved during an hour or so had their items saved into srv89 instead of srv87. When other web servers then came to read the pages back, they read from srv87 or its clone srv88, and received a _different page_ with the _same blob ID number_.
Will try to have the broken ones recovered tonight or tomorrow.
I'm pretty sure I've got the broken blobs all dealt with now (with the possible exception of some that were in deleted pages, in which case who cares ;)
Most cases were automatically resolvable due to the separate timing; conflicting entries on the good server were made two or three days earlier, generally.
It's possible that some pages were affected by caching of negative results loading missing items out of the other server; those should fall out of cache sooner or later, or a purge will take care of them individually.
- -- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org