Answering a few questions in one place.
On Fri, Nov 18, 2011 at 10:29 AM, Brion Vibber <brion(a)wikimedia.org> wrote:
Hmm... what I'd expect is that if one ES save target database is in
read-only, the system should cycle through to the next available one that
is working -- the save should then succeed transparently.
Do we not have that sort of write failover logic, or are *all* ES clusters
getting locked somehow?
The last step of the maintenance was to switch the master for article
writes from ms3 to es3. In order to make sure no data is lost during
the transition, I marked the master read-only for the duration of the
switch. Given that there is only one ES target database to which
writes are sent (currently es3), there is nowhere to which to
failover. (All slaves run read-only all the time.)
On Fri, Nov 18, 2011 at 2:48 PM, Platonides <Platonides(a)gmail.com> wrote:
On 18/11/11 18:51, Ben Hartshorne wrote:
Hi everyone,
I just posted a
note<http://blog.wikimedia.org/2011/11/18/nobody-notices-when-its-not-br…
the blog about our new external store but wanted to add a few details
here. The deploy went smoothly, and I'm very happy with how the project
progressed overall.
I thought the ES apaches had disappeared time ago. Maybe those were just
the memcached apaches.
'fraid not. The ES apaches were removed from service as part of this
deploy; it happened on Nov.9th at 19:36 <ref>
http://wikitech.wikimedia.org/view/Server_admin_log</ref>
The project
originally included recompressing all of the object types in
the external store databases, continuing the work that was started in
2010.
Are you aware of
https://bugzilla.wikimedia.org/20757#c9 ?
Are you very sure compressOld won't break anything? Anything that
touches text table has a big potential for data loss.
Yup! I left off what should have been the second half of that
sentence. "The project originally... but that goal was dropped after
discovering the dragons that lurked in that section of code." After
reading that bug and doing the investigation that generated
http://wikitech.wikimedia.org/view/Text_storage_data we decided that
it would be better to separate recompressing the old text from moving
it to new and more reliable hardware. Now that the move is complete,
we can start carefully probing into the recompression part again
(though this work has not yet been scheduled).
I spent some time doing verification that
things were behaving as
expected and it turns out they weren't. Upon examining the count of
different data types in the external store content, I found that some types
that are no longer supposed to be used were still getting created. I've
filed
https://bugzilla.wikimedia.org/show_bug.cgi?id=32478 to track the
investigation and resolution of those differences.
I found out the source of those 'gzip,external' entries: AbuseFilter.
Noted at the bug.
You and Brion both rock. Thank you for finding this! I look forward
to seeing that fix get in and then re-running the stats generation a
few more times to validate, over a period of time, that none of those
counters are moving in unexpected ways.
During the
deploy there was a brief (about 10 minute) period during which
article saves failed due to the external store databases being in read-only
mode. As expected, some folks showed up in IRC telling us of the
'problem'. After the migration was complete we brainstormed a bit in IRC
about good ways of informing editors of planned maintenance such as this
migration. The regular databases (s3, etc.) have a read-only mode flag so
that the affected wikis show a reasonable error, but the external store
databases are a little different. Because of the way they're spread out,
the outage of a specific database cluster does not affect specific language
projects, but instead affects a specific time range for all wikis.
Additionally, the currently writable external store database affects
article edits on all wikis.
You could have made everything read-only, too. It's a wider scope than
strictly needed, but I don't think it's that important to keep eg.
watchlist changeable if edits don't work.
I'll label that suggestion #5. :)
Thanks,
-ben