Hi everyone,
I just posted a notehttp://blog.wikimedia.org/2011/11/18/nobody-notices-when-its-not-broken-new-database-servers-deployed/on the blog about our new external store but wanted to add a few details here. The deploy went smoothly, and I'm very happy with how the project progressed overall. There are plenty more details on the project itself on the project wiki pagehttp://wikitech.wikimedia.org/view/External_storage/Update_2011-08and hiding in RT. there were a few followup things to come out of it, and I want to talk through those in hopes that someone either picks them up or has suggestions on what to do.
The project originally included recompressing all of the object types in the external store databases, continuing the work that was started in 2010. I spent some time doing verification that things were behaving as expected and it turns out they weren't. Upon examining the count of different data types in the external store content, I found that some types that are no longer supposed to be used were still getting created. I've filed https://bugzilla.wikimedia.org/show_bug.cgi?id=32478 to track the investigation and resolution of those differences.
During the deploy there was a brief (about 10 minute) period during which article saves failed due to the external store databases being in read-only mode. As expected, some folks showed up in IRC telling us of the 'problem'. After the migration was complete we brainstormed a bit in IRC about good ways of informing editors of planned maintenance such as this migration. The regular databases (s3, etc.) have a read-only mode flag so that the affected wikis show a reasonable error, but the external store databases are a little different. Because of the way they're spread out, the outage of a specific database cluster does not affect specific language projects, but instead affects a specific time range for all wikis. Additionally, the currently writable external store database affects article edits on all wikis.
There were a few suggestions thrown around: 1) use central notice. This would certainly have the effect of alerting all wikis that there was some maintenance, but it has the disadvantage of telling all *readers* about the outage, rather than only the people that would actually be interested (those editing pages). 2) make mediawiki cache the change to conceal the outage from editors. The idea here is that mediawiki would notice that the backend database is currently in read-only mode and would cache the change and write it to the DB when it returns to read-write mode. There are a number of technical challenges here, as well as the introduction of another system (the change cache), but it's an interesting way around the problem, since rather than addressing how to inform editors of impending maintenance it simply eliminates the necessity for that communication. 3) throw up a banner on the edit page itself. The time when we want to inform someone that there is going to be maintenance that will impede editing is when the user begins an edit. (at the moment we inform them when they try to save the edit in the form of an error message.) If there was a banner on all edit pages that informed the user not to save their document during a specific time period, they could choose to postpone the edit or finish quickly. The text would be something like "There will be planned maintenance starting in 23 minutes and lasting for 30 minutes. You will be unable to save edits during the maintenance period. Please save your work before maintenance begins." During the maintenance, we could change the message to be more visible, or we could take more drastic action such as disabling the edit or save buttons. 4) don't make any change from what we do now. The external store databases rarely fail or undergo maintenance. Increasing the complexity of the system to protect against their outage will be more likely cause harm than the outages themselves. Instead, just announce it on the blog before and apologize to anybody affected afterwards.
I'm sure there are some more ideas on what we should do, as well as opinions about these various options out there. Discuss! :) I haven't filed a bug yet, but will do so if this conversation comes to some consensus about a specific thing that should be done.
Thanks,
-ben
You could conceivably embed some javascript in a CentralNotice banner such that it would only display when a user was editing a page (addressing #1 and #3 of the options you came up with).
On Fri, Nov 18, 2011 at 9:51 AM, Ben Hartshorne bhartshorne@wikimedia.orgwrote:
Hi everyone,
I just posted a note< http://blog.wikimedia.org/2011/11/18/nobody-notices-when-its-not-broken-new-...
on
the blog about our new external store but wanted to add a few details here. The deploy went smoothly, and I'm very happy with how the project progressed overall. There are plenty more details on the project itself on the project wiki page<http://wikitech.wikimedia.org/view/External_storage/Update_2011-08
and
hiding in RT. there were a few followup things to come out of it, and I want to talk through those in hopes that someone either picks them up or has suggestions on what to do.
The project originally included recompressing all of the object types in the external store databases, continuing the work that was started in 2010. I spent some time doing verification that things were behaving as expected and it turns out they weren't. Upon examining the count of different data types in the external store content, I found that some types that are no longer supposed to be used were still getting created. I've filed https://bugzilla.wikimedia.org/show_bug.cgi?id=32478 to track the investigation and resolution of those differences.
During the deploy there was a brief (about 10 minute) period during which article saves failed due to the external store databases being in read-only mode. As expected, some folks showed up in IRC telling us of the 'problem'. After the migration was complete we brainstormed a bit in IRC about good ways of informing editors of planned maintenance such as this migration. The regular databases (s3, etc.) have a read-only mode flag so that the affected wikis show a reasonable error, but the external store databases are a little different. Because of the way they're spread out, the outage of a specific database cluster does not affect specific language projects, but instead affects a specific time range for all wikis. Additionally, the currently writable external store database affects article edits on all wikis.
There were a few suggestions thrown around:
- use central notice. This would certainly have the effect of alerting
all wikis that there was some maintenance, but it has the disadvantage of telling all *readers* about the outage, rather than only the people that would actually be interested (those editing pages). 2) make mediawiki cache the change to conceal the outage from editors. The idea here is that mediawiki would notice that the backend database is currently in read-only mode and would cache the change and write it to the DB when it returns to read-write mode. There are a number of technical challenges here, as well as the introduction of another system (the change cache), but it's an interesting way around the problem, since rather than addressing how to inform editors of impending maintenance it simply eliminates the necessity for that communication. 3) throw up a banner on the edit page itself. The time when we want to inform someone that there is going to be maintenance that will impede editing is when the user begins an edit. (at the moment we inform them when they try to save the edit in the form of an error message.) If there was a banner on all edit pages that informed the user not to save their document during a specific time period, they could choose to postpone the edit or finish quickly. The text would be something like "There will be planned maintenance starting in 23 minutes and lasting for 30 minutes. You will be unable to save edits during the maintenance period. Please save your work before maintenance begins." During the maintenance, we could change the message to be more visible, or we could take more drastic action such as disabling the edit or save buttons. 4) don't make any change from what we do now. The external store databases rarely fail or undergo maintenance. Increasing the complexity of the system to protect against their outage will be more likely cause harm than the outages themselves. Instead, just announce it on the blog before and apologize to anybody affected afterwards.
I'm sure there are some more ideas on what we should do, as well as opinions about these various options out there. Discuss! :) I haven't filed a bug yet, but will do so if this conversation comes to some consensus about a specific thing that should be done.
Thanks,
-ben _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hmm... what I'd expect is that if one ES save target database is in read-only, the system should cycle through to the next available one that is working -- the save should then succeed transparently.
Do we not have that sort of write failover logic, or are *all* ES clusters getting locked somehow?
-- brion On Nov 18, 2011 9:51 AM, "Ben Hartshorne" bhartshorne@wikimedia.org wrote:
Hi everyone,
I just posted a note< http://blog.wikimedia.org/2011/11/18/nobody-notices-when-its-not-broken-new-...
on
the blog about our new external store but wanted to add a few details here. The deploy went smoothly, and I'm very happy with how the project progressed overall. There are plenty more details on the project itself on the project wiki page<http://wikitech.wikimedia.org/view/External_storage/Update_2011-08
and
hiding in RT. there were a few followup things to come out of it, and I want to talk through those in hopes that someone either picks them up or has suggestions on what to do.
The project originally included recompressing all of the object types in the external store databases, continuing the work that was started in 2010. I spent some time doing verification that things were behaving as expected and it turns out they weren't. Upon examining the count of different data types in the external store content, I found that some types that are no longer supposed to be used were still getting created. I've filed https://bugzilla.wikimedia.org/show_bug.cgi?id=32478 to track the investigation and resolution of those differences.
During the deploy there was a brief (about 10 minute) period during which article saves failed due to the external store databases being in read-only mode. As expected, some folks showed up in IRC telling us of the 'problem'. After the migration was complete we brainstormed a bit in IRC about good ways of informing editors of planned maintenance such as this migration. The regular databases (s3, etc.) have a read-only mode flag so that the affected wikis show a reasonable error, but the external store databases are a little different. Because of the way they're spread out, the outage of a specific database cluster does not affect specific language projects, but instead affects a specific time range for all wikis. Additionally, the currently writable external store database affects article edits on all wikis.
There were a few suggestions thrown around:
- use central notice. This would certainly have the effect of alerting
all wikis that there was some maintenance, but it has the disadvantage of telling all *readers* about the outage, rather than only the people that would actually be interested (those editing pages). 2) make mediawiki cache the change to conceal the outage from editors. The idea here is that mediawiki would notice that the backend database is currently in read-only mode and would cache the change and write it to the DB when it returns to read-write mode. There are a number of technical challenges here, as well as the introduction of another system (the change cache), but it's an interesting way around the problem, since rather than addressing how to inform editors of impending maintenance it simply eliminates the necessity for that communication. 3) throw up a banner on the edit page itself. The time when we want to inform someone that there is going to be maintenance that will impede editing is when the user begins an edit. (at the moment we inform them when they try to save the edit in the form of an error message.) If there was a banner on all edit pages that informed the user not to save their document during a specific time period, they could choose to postpone the edit or finish quickly. The text would be something like "There will be planned maintenance starting in 23 minutes and lasting for 30 minutes. You will be unable to save edits during the maintenance period. Please save your work before maintenance begins." During the maintenance, we could change the message to be more visible, or we could take more drastic action such as disabling the edit or save buttons. 4) don't make any change from what we do now. The external store databases rarely fail or undergo maintenance. Increasing the complexity of the system to protect against their outage will be more likely cause harm than the outages themselves. Instead, just announce it on the blog before and apologize to anybody affected afterwards.
I'm sure there are some more ideas on what we should do, as well as opinions about these various options out there. Discuss! :) I haven't filed a bug yet, but will do so if this conversation comes to some consensus about a specific thing that should be done.
Thanks,
-ben _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Answering a few questions in one place.
On Fri, Nov 18, 2011 at 10:29 AM, Brion Vibber brion@wikimedia.org wrote:
Hmm... what I'd expect is that if one ES save target database is in read-only, the system should cycle through to the next available one that is working -- the save should then succeed transparently.
Do we not have that sort of write failover logic, or are *all* ES clusters getting locked somehow?
The last step of the maintenance was to switch the master for article writes from ms3 to es3. In order to make sure no data is lost during the transition, I marked the master read-only for the duration of the switch. Given that there is only one ES target database to which writes are sent (currently es3), there is nowhere to which to failover. (All slaves run read-only all the time.)
On Fri, Nov 18, 2011 at 2:48 PM, Platonides Platonides@gmail.com wrote:
On 18/11/11 18:51, Ben Hartshorne wrote:
Hi everyone,
I just posted a notehttp://blog.wikimedia.org/2011/11/18/nobody-notices-when-its-not-broken-new-database-servers-deployed/on the blog about our new external store but wanted to add a few details here. The deploy went smoothly, and I'm very happy with how the project progressed overall.
I thought the ES apaches had disappeared time ago. Maybe those were just the memcached apaches.
'fraid not. The ES apaches were removed from service as part of this deploy; it happened on Nov.9th at 19:36 <ref> http://wikitech.wikimedia.org/view/Server_admin_log</ref>
The project originally included recompressing all of the object types in the external store databases, continuing the work that was started in 2010.
Are you aware of https://bugzilla.wikimedia.org/20757#c9 ? Are you very sure compressOld won't break anything? Anything that touches text table has a big potential for data loss.
Yup! I left off what should have been the second half of that sentence. "The project originally... but that goal was dropped after discovering the dragons that lurked in that section of code." After reading that bug and doing the investigation that generated http://wikitech.wikimedia.org/view/Text_storage_data we decided that it would be better to separate recompressing the old text from moving it to new and more reliable hardware. Now that the move is complete, we can start carefully probing into the recompression part again (though this work has not yet been scheduled).
I spent some time doing verification that things were behaving as expected and it turns out they weren't. Upon examining the count of different data types in the external store content, I found that some types that are no longer supposed to be used were still getting created. I've filed https://bugzilla.wikimedia.org/show_bug.cgi?id=32478 to track the investigation and resolution of those differences.
I found out the source of those 'gzip,external' entries: AbuseFilter. Noted at the bug.
You and Brion both rock. Thank you for finding this! I look forward to seeing that fix get in and then re-running the stats generation a few more times to validate, over a period of time, that none of those counters are moving in unexpected ways.
During the deploy there was a brief (about 10 minute) period during which article saves failed due to the external store databases being in read-only mode. As expected, some folks showed up in IRC telling us of the 'problem'. After the migration was complete we brainstormed a bit in IRC about good ways of informing editors of planned maintenance such as this migration. The regular databases (s3, etc.) have a read-only mode flag so that the affected wikis show a reasonable error, but the external store databases are a little different. Because of the way they're spread out, the outage of a specific database cluster does not affect specific language projects, but instead affects a specific time range for all wikis. Additionally, the currently writable external store database affects article edits on all wikis.
You could have made everything read-only, too. It's a wider scope than strictly needed, but I don't think it's that important to keep eg. watchlist changeable if edits don't work.
I'll label that suggestion #5. :)
Thanks,
-ben
On Fri, Nov 18, 2011 at 3:41 PM, Ben Hartshorne bhartshorne@wikimedia.orgwrote:
Answering a few questions in one place.
On Fri, Nov 18, 2011 at 10:29 AM, Brion Vibber brion@wikimedia.org wrote:
Hmm... what I'd expect is that if one ES save target database is in read-only, the system should cycle through to the next available one that is working -- the save should then succeed transparently.
Do we not have that sort of write failover logic, or are *all* ES
clusters
getting locked somehow?
The last step of the maintenance was to switch the master for article writes from ms3 to es3. In order to make sure no data is lost during the transition, I marked the master read-only for the duration of the switch. Given that there is only one ES target database to which writes are sent (currently es3), there is nowhere to which to failover. (All slaves run read-only all the time.)
*nod* logical enough. For the future I'd recommend planning a temporary 'holding zone' cluster that would be used only during the changeover -- it would remain read-write while the main ones are being copied.
Then after switching writes to the new targets, the holding zone can go read-only while it gets copied over to the new target, which should go relatively fast.
This would be just another part of the ES system rather than a separate cache, so should remain reasonably robust: if something goes awry with the main copy to the new clusters, you can safely stop: the holding zone will just sits with the old servers and can just keep running like the other ES clusters, unlike some sort of cache which might lose data.
-- brion
On 18/11/11 18:51, Ben Hartshorne wrote:
Hi everyone,
I just posted a notehttp://blog.wikimedia.org/2011/11/18/nobody-notices-when-its-not-broken-new-database-servers-deployed/on the blog about our new external store but wanted to add a few details here. The deploy went smoothly, and I'm very happy with how the project progressed overall.
I thought the ES apaches had disappeared time ago. Maybe those were just the memcached apaches.
The project originally included recompressing all of the object types in the external store databases, continuing the work that was started in 2010.
Are you aware of https://bugzilla.wikimedia.org/20757#c9 ? Are you very sure compressOld won't break anything? Anything that touches text table has a big potential for data loss.
I spent some time doing verification that things were behaving as expected and it turns out they weren't. Upon examining the count of different data types in the external store content, I found that some types that are no longer supposed to be used were still getting created. I've filed https://bugzilla.wikimedia.org/show_bug.cgi?id=32478 to track the investigation and resolution of those differences.
I found out the source of those 'gzip,external' entries: AbuseFilter. Noted at the bug.
During the deploy there was a brief (about 10 minute) period during which article saves failed due to the external store databases being in read-only mode. As expected, some folks showed up in IRC telling us of the 'problem'. After the migration was complete we brainstormed a bit in IRC about good ways of informing editors of planned maintenance such as this migration. The regular databases (s3, etc.) have a read-only mode flag so that the affected wikis show a reasonable error, but the external store databases are a little different. Because of the way they're spread out, the outage of a specific database cluster does not affect specific language projects, but instead affects a specific time range for all wikis. Additionally, the currently writable external store database affects article edits on all wikis.
You could have made everything read-only, too. It's a wider scope than strictly needed, but I don't think it's that important to keep eg. watchlist changeable if edits don't work.
There were a few suggestions thrown around:
- make mediawiki cache the change to conceal the outage from editors. The
idea here is that mediawiki would notice that the backend database is currently in read-only mode and would cache the change and write it to the DB when it returns to read-write mode. There are a number of technical challenges here, as well as the introduction of another system (the change cache), but it's an interesting way around the problem, since rather than addressing how to inform editors of impending maintenance it simply eliminates the necessity for that communication.
I don't like it. Some changes get cached unnoticed somewhere (eg. memcached), then suddenly they fail to transfer to the end system, and a few days later a bunch of content magically disappears.
It might be acceptable to directly store in text table if all ES are down, although that's probably not in our interest.
- throw up a banner on the edit page itself. (...)
During the maintenance, we could change the message to be more visible, or we could take more drastic action such as disabling the edit or save buttons.
I'm not opposed to advisory edit banners, but don't hide Save buttons if it may well be working (and less the edit ones).
On 11/18/2011 06:51 PM, Ben Hartshorne wrote:
- throw up a banner on the edit page itself. The time when we want to
inform someone that there is going to be maintenance that will impede editing is when the user begins an edit. (at the moment we inform them when they try to save the edit in the form of an error message.) If there was a banner on all edit pages that informed the user not to save their document during a specific time period, they could choose to postpone the edit or finish quickly.
Yes, maybe this is a good solution. But there is far too much information on the edit page already, almost like some shrink-wrap or click-through license text that nobody reads.
For a beginner or anonymous editor, the edit page could just have the warning instead of the edit box if any maintenance is planned for the next 30 minutes: "Please come back later". Make it super simple and friendly, with a short explanation that you only do this 7 hours per year, and that donations pay for keeping these servers running.
For experienced users, all the clutter with copyright warnings could be folded away, because they already know. And your information about planned maintenance could be shown as a warning: "Please edit, but be careful".
wikitech-l@lists.wikimedia.org