-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
A problem was reported to us today which I have tracked down to a configuration consistency problem.
First some background:
We use "external storage" servers to store the bulk of page text contents for Wikimedia's wikis. These are web servers where we make use of otherwise unused disk space by sticking on a copy of MySQL and storing compressed text blobs.
For reliability, they come in clusters, using MySQL's replication. So if one dies, we can shuffle them about with no loss of data.
New pages get saved into the most recent one or two ES clusters, and every once in a while we close an old one off and add a new one on the end.
The current write clusters are cluster8 and cluster9. The DB master for cluster8 was srv89, which had recently had some problems: it was apparently overloaded, and didn't respond to login attempts.
On November 28, we had it rebooted, but it didn't come back online. Cluster8 was reshuffled, with the master moved to srv87.
At about 22:30 UTC, srv89 came back online. Unfortunately this meant that its web server came online as well, and apparently made it back into the load balancing pool -- but with old copies of the MediaWiki configuration.
srv89's MediaWiki still had the configuration files telling it to save text blobs into the external storage server on... srv89, itself.
So, some portion of pages saved during an hour or so had their items saved into srv89 instead of srv87. When other web servers then came to read the pages back, they read from srv87 or its clone srv88, and received a _different page_ with the _same blob ID number_.
The good news is that the data is probably recoverable, on this procedure: * For each wiki, list blobs which differ between srv87 and srv88 * For each mismatched blob, locate the two (or more) revisions using it * Disambiguate them somehow... (probably mostly automatable... i hope!) * Copy the srv88 blobs onto srv87, and reassign the revisions to the appropriate new blob ID numbers
I'm not sure how many pages are affected, but it appears to be 25 edits on fr.wikipedia.org, so perhaps a few hundred overall.
This could most likely have been avoided by forcing web servers to sync their MediaWiki configurations before starting Apache on boot.
- -- brion vibber (brion @ pobox.com / brion @ wikimedia.org)