Re: [Wikitech-l] new external store databases deployed - followup questions

18 Nov 2011

Answering a few questions in one place.

On Fri, Nov 18, 2011 at 10:29 AM, Brion Vibber &lt;brion(a)wikimedia.org&gt; wrote:
...

 Hmm... what I'd expect is that if one ES save target database is in
 read-only, the system should cycle through to the next available one that
 is working -- the save should then succeed transparently.

 Do we not have that sort of write failover logic, or are *all* ES clusters
 getting locked somehow? 
The last step of the maintenance was to switch the master for article
writes from ms3 to es3.  In order to make sure no data is lost during
the transition, I marked the master read-only for the duration of the
switch.  Given that there is only one ES target database to which
writes are sent (currently es3), there is nowhere to which to
failover.  (All slaves run read-only all the time.)

On Fri, Nov 18, 2011 at 2:48 PM, Platonides &lt;Platonides(a)gmail.com&gt; wrote:
...
  On 18/11/11 18:51, Ben Hartshorne wrote:
  Hi everyone,

 I just posted a

note<http://blog.wikimedia.org/2011/11/18/nobody-notices-when-its-not-br…
 the blog about our new external store but wanted to add a few details
 here.  The deploy went smoothly, and I'm very happy with how the project
 progressed overall. 
 I thought the ES apaches had disappeared time ago. Maybe those were just
 the memcached apaches. 
'fraid not.  The ES apaches were removed from service as part of this
deploy; it happened on Nov.9th at 19:36 <ref>
http://wikitech.wikimedia.org/view/Server_admin_log</ref>

...
   The project
originally included recompressing all of the object types in
 the external store databases, continuing the work that was started in
 2010. 
 Are you aware of https://bugzilla.wikimedia.org/20757#c9 ?
 Are you very sure compressOld won't break anything? Anything that
 touches text table has a big potential for data loss. 
Yup!  I left off what should have been the second half of that
sentence.  "The project originally... but that goal was dropped after
discovering the dragons that lurked in that section of code."  After
reading that bug and doing the investigation that generated
http://wikitech.wikimedia.org/view/Text_storage_data we decided that
it would be better to separate recompressing the old text from moving
it to new and more reliable hardware.  Now that the move is complete,
we can start carefully probing into the recompression part again
(though this work has not yet been scheduled).

...

    I spent some time doing verification that
things were behaving as
 expected and it turns out they weren't.  Upon examining the count of
 different data types in the external store content, I found that some types
 that are no longer supposed to be used were still getting created.  I've
 filed https://bugzilla.wikimedia.org/show_bug.cgi?id=32478 to track the
 investigation and resolution of those differences. 
 I found out the source of those 'gzip,external' entries: AbuseFilter.
 Noted at the bug. 
You and Brion both rock.  Thank you for finding this!  I look forward
to seeing that fix get in and then re-running the stats generation a
few more times to validate, over a period of time, that none of those
counters are moving in unexpected ways.

...
   During the
deploy there was a brief (about 10 minute) period during which
 article saves failed due to the external store databases being in read-only
 mode.  As expected, some folks showed up in IRC telling us of the
 'problem'.  After the migration was complete we brainstormed a bit in IRC
 about good ways of informing editors of planned maintenance such as this
 migration.  The regular databases (s3, etc.) have a read-only mode flag so
 that the affected wikis show a reasonable error, but the external store
 databases are a little different.  Because of the way they're spread out,
 the outage of a specific database cluster does not affect specific language
 projects, but instead affects a specific time range for all wikis.
 Additionally, the currently writable external store database affects
 article edits on all wikis. 
 You could have made everything read-only, too. It's a wider scope than
 strictly needed, but I don't think it's that important to keep eg.
 watchlist changeable if edits don't work. 
I'll label that suggestion #5.  :)

Thanks,

-ben

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] new external store databases deployed - followup questions