Answering a few questions in one place.
On Fri, Nov 18, 2011 at 10:29 AM, Brion Vibber brion@wikimedia.org wrote:
Hmm... what I'd expect is that if one ES save target database is in read-only, the system should cycle through to the next available one that is working -- the save should then succeed transparently.
Do we not have that sort of write failover logic, or are *all* ES clusters getting locked somehow?
The last step of the maintenance was to switch the master for article writes from ms3 to es3. In order to make sure no data is lost during the transition, I marked the master read-only for the duration of the switch. Given that there is only one ES target database to which writes are sent (currently es3), there is nowhere to which to failover. (All slaves run read-only all the time.)
On Fri, Nov 18, 2011 at 2:48 PM, Platonides Platonides@gmail.com wrote:
On 18/11/11 18:51, Ben Hartshorne wrote:
Hi everyone,
I just posted a notehttp://blog.wikimedia.org/2011/11/18/nobody-notices-when-its-not-broken-new-database-servers-deployed/on the blog about our new external store but wanted to add a few details here. The deploy went smoothly, and I'm very happy with how the project progressed overall.
I thought the ES apaches had disappeared time ago. Maybe those were just the memcached apaches.
'fraid not. The ES apaches were removed from service as part of this deploy; it happened on Nov.9th at 19:36 <ref> http://wikitech.wikimedia.org/view/Server_admin_log</ref>
The project originally included recompressing all of the object types in the external store databases, continuing the work that was started in 2010.
Are you aware of https://bugzilla.wikimedia.org/20757#c9 ? Are you very sure compressOld won't break anything? Anything that touches text table has a big potential for data loss.
Yup! I left off what should have been the second half of that sentence. "The project originally... but that goal was dropped after discovering the dragons that lurked in that section of code." After reading that bug and doing the investigation that generated http://wikitech.wikimedia.org/view/Text_storage_data we decided that it would be better to separate recompressing the old text from moving it to new and more reliable hardware. Now that the move is complete, we can start carefully probing into the recompression part again (though this work has not yet been scheduled).
I spent some time doing verification that things were behaving as expected and it turns out they weren't. Upon examining the count of different data types in the external store content, I found that some types that are no longer supposed to be used were still getting created. I've filed https://bugzilla.wikimedia.org/show_bug.cgi?id=32478 to track the investigation and resolution of those differences.
I found out the source of those 'gzip,external' entries: AbuseFilter. Noted at the bug.
You and Brion both rock. Thank you for finding this! I look forward to seeing that fix get in and then re-running the stats generation a few more times to validate, over a period of time, that none of those counters are moving in unexpected ways.
During the deploy there was a brief (about 10 minute) period during which article saves failed due to the external store databases being in read-only mode. As expected, some folks showed up in IRC telling us of the 'problem'. After the migration was complete we brainstormed a bit in IRC about good ways of informing editors of planned maintenance such as this migration. The regular databases (s3, etc.) have a read-only mode flag so that the affected wikis show a reasonable error, but the external store databases are a little different. Because of the way they're spread out, the outage of a specific database cluster does not affect specific language projects, but instead affects a specific time range for all wikis. Additionally, the currently writable external store database affects article edits on all wikis.
You could have made everything read-only, too. It's a wider scope than strictly needed, but I don't think it's that important to keep eg. watchlist changeable if edits don't work.
I'll label that suggestion #5. :)
Thanks,
-ben