Re: [Wikitech-l] Data center switch-over moving ahead next week: please stay available :)

List overview All Threads
Download

newer

older

Discovery Weekly Update for the...

2016-04-20 Scrum of Scrums meeting...

Mark Bergsma

21 Apr 2016 21 Apr '16

3:53 p.m.

Hi everyone,

After we've been successfully serving our sites from our backup data-center codfw (Dallas) for the past two days, we're now starting our switch back to eqiad (Ashburn) as planned[1].

We've already moved cache traffic back to eqiad, and within the next minutes, we'll disable editing by going read-only for approximately 30 minutes - hopefully a bit faster than 2 days ago.

[1] http://blog.wikimedia.org/2016/04/11/wikimedia-failover-test/

On Tue, Apr 19, 2016 at 6:00 PM, Mark Bergsma mark@wikimedia.org wrote:

...

Hi all,

Today the data center switch-over commenced as planned, and has just fully completed successfully. We are now serving our sites from codfw (Dallas, Texas) for the next 2 days if all stays well.

We switched the wikis to read-only (editing disabled) at 14:02 UTC, and went back read-write at 14:48 UTC - a little longer than planned. While edits were possible then, unfortunately at that time Special:Recent Changes (and related change feeds) were not yet working due to an unexpected configuration problem with our Redis servers until 15:10 UTC, when we found and fixed the issue. The site has stayed up and available for readers throughout the entire migration.

Overall the procedure was a success with few problems along the way. However we've also carefully kept track of any issues and delays we encountered for evaluation to improve and speed up the procedure, and reducing impact to our users - some of which will already be implemented for our switch back on Thursday.

We're still expecting to find (possibly subtle) issues today, and would like everyone who notices anything to use the following channels to report them:

File a Phabricator issue with project #codfw-rollout

Report issues on IRC: Freenode channel #wikimedia-tech (if urgent)

Send an e-mail to the Operations list: ops@lists.wikimedia.org

We're not done yet, but thanks to all who have helped so far. :-)

Mark

-- Mark Bergsma mark@wikimedia.org Lead Operations Architect Director of Technical Operations Wikimedia Foundation

Show replies by date

Mark Bergsma

21 Apr 21 Apr

5:37 p.m.

New subject: Data center switch-over moving ahead next week: please stay available :)

We've just completed the switch back, and all services are running from our main data center eqiad (Ashburn) again.

The process went very smooth this time around. In the past two days leading up to this, we've been able to either fix or work around the most important issues we encountered on Tuesday. This meant that we had no real setbacks or unanticipated delays today, and therefore were able to complete the most time pressing and user-impacting part (during which MediaWiki is read-only) in 20 minutes, down from ~45 minutes two days ago.

However, we'll be doing this again in the future, and until then we'll work on improving and further automating this process to get it down to hopefully much lower levels of impact and duration.

Please let us know if you see any issues which may be caused by the switch-over(s).

Thanks much to everyone involved!

Mark

On Thu, Apr 21, 2016 at 3:53 PM, Mark Bergsma mark@wikimedia.org wrote:

...

Hi everyone,

After we've been successfully serving our sites from our backup data-center codfw (Dallas) for the past two days, we're now starting our switch back to eqiad (Ashburn) as planned[1].

We've already moved cache traffic back to eqiad, and within the next minutes, we'll disable editing by going read-only for approximately 30 minutes - hopefully a bit faster than 2 days ago.

[1] http://blog.wikimedia.org/2016/04/11/wikimedia-failover-test/

On Tue, Apr 19, 2016 at 6:00 PM, Mark Bergsma mark@wikimedia.org wrote:

...
Hi all,

Today the data center switch-over commenced as planned, and has just fully completed successfully. We are now serving our sites from codfw (Dallas, Texas) for the next 2 days if all stays well.

We switched the wikis to read-only (editing disabled) at 14:02 UTC, and went back read-write at 14:48 UTC - a little longer than planned. While edits were possible then, unfortunately at that time Special:Recent Changes (and related change feeds) were not yet working due to an unexpected configuration problem with our Redis servers until 15:10 UTC, when we found and fixed the issue. The site has stayed up and available for readers throughout the entire migration.

Overall the procedure was a success with few problems along the way. However we've also carefully kept track of any issues and delays we encountered for evaluation to improve and speed up the procedure, and reducing impact to our users - some of which will already be implemented for our switch back on Thursday.

We're still expecting to find (possibly subtle) issues today, and would like everyone who notices anything to use the following channels to report them:

File a Phabricator issue with project #codfw-rollout

Report issues on IRC: Freenode channel #wikimedia-tech (if urgent)

Send an e-mail to the Operations list: ops@lists.wikimedia.org

We're not done yet, but thanks to all who have helped so far. :-)

Mark

-- Mark Bergsma mark@wikimedia.org Lead Operations Architect Director of Technical Operations Wikimedia Foundation

-- Mark Bergsma mark@wikimedia.org Lead Operations Architect Director of Technical Operations Wikimedia Foundation

Toby Negrin

5:44 p.m.

New subject: [Ops] Data center switch-over moving ahead next week: please stay available :)

Congrats Mark and everyone else involved. This is a big step for reliability and performance of the sites and a difficult technical task to say the least.

Well done!

-Toby

On Thu, Apr 21, 2016 at 8:37 AM, Mark Bergsma mark@wikimedia.org wrote:

...

We've just completed the switch back, and all services are running from our main data center eqiad (Ashburn) again.

The process went very smooth this time around. In the past two days leading up to this, we've been able to either fix or work around the most important issues we encountered on Tuesday. This meant that we had no real setbacks or unanticipated delays today, and therefore were able to complete the most time pressing and user-impacting part (during which MediaWiki is read-only) in 20 minutes, down from ~45 minutes two days ago.

However, we'll be doing this again in the future, and until then we'll work on improving and further automating this process to get it down to hopefully much lower levels of impact and duration.

Please let us know if you see any issues which may be caused by the switch-over(s).

Thanks much to everyone involved!

Mark

On Thu, Apr 21, 2016 at 3:53 PM, Mark Bergsma mark@wikimedia.org wrote:

...
Hi everyone,

After we've been successfully serving our sites from our backup data-center codfw (Dallas) for the past two days, we're now starting our switch back to eqiad (Ashburn) as planned[1].

We've already moved cache traffic back to eqiad, and within the next minutes, we'll disable editing by going read-only for approximately 30 minutes - hopefully a bit faster than 2 days ago.

[1] http://blog.wikimedia.org/2016/04/11/wikimedia-failover-test/

On Tue, Apr 19, 2016 at 6:00 PM, Mark Bergsma mark@wikimedia.org wrote:

...
Hi all,

Today the data center switch-over commenced as planned, and has just fully completed successfully. We are now serving our sites from codfw (Dallas, Texas) for the next 2 days if all stays well.

We switched the wikis to read-only (editing disabled) at 14:02 UTC, and went back read-write at 14:48 UTC - a little longer than planned. While edits were possible then, unfortunately at that time Special:Recent Changes (and related change feeds) were not yet working due to an unexpected configuration problem with our Redis servers until 15:10 UTC, when we found and fixed the issue. The site has stayed up and available for readers throughout the entire migration.

Overall the procedure was a success with few problems along the way. However we've also carefully kept track of any issues and delays we encountered for evaluation to improve and speed up the procedure, and reducing impact to our users - some of which will already be implemented for our switch back on Thursday.

We're still expecting to find (possibly subtle) issues today, and would like everyone who notices anything to use the following channels to report them:

File a Phabricator issue with project #codfw-rollout

Report issues on IRC: Freenode channel #wikimedia-tech (if urgent)

Send an e-mail to the Operations list: ops@lists.wikimedia.org

We're not done yet, but thanks to all who have helped so far. :-)

Mark

-- Mark Bergsma mark@wikimedia.org Lead Operations Architect Director of Technical Operations Wikimedia Foundation

-- Mark Bergsma mark@wikimedia.org Lead Operations Architect Director of Technical Operations Wikimedia Foundation

Ops mailing list Ops@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ops

Arthur Richards

5:55 p.m.

New subject: [Ops] Data center switch-over moving ahead next week: please stay available :)

This is so rad - congratulations indeed to everyone who's been working on this!

On Thu, Apr 21, 2016 at 8:44 AM, Toby Negrin tnegrin@wikimedia.org wrote:

...

Congrats Mark and everyone else involved. This is a big step for reliability and performance of the sites and a difficult technical task to say the least.

Well done!

-Toby

On Thu, Apr 21, 2016 at 8:37 AM, Mark Bergsma mark@wikimedia.org wrote:

...
We've just completed the switch back, and all services are running from our main data center eqiad (Ashburn) again.

The process went very smooth this time around. In the past two days leading up to this, we've been able to either fix or work around the most important issues we encountered on Tuesday. This meant that we had no real setbacks or unanticipated delays today, and therefore were able to complete the most time pressing and user-impacting part (during which MediaWiki is read-only) in 20 minutes, down from ~45 minutes two days ago.

However, we'll be doing this again in the future, and until then we'll work on improving and further automating this process to get it down to hopefully much lower levels of impact and duration.

Please let us know if you see any issues which may be caused by the switch-over(s).

Thanks much to everyone involved!

Mark

On Thu, Apr 21, 2016 at 3:53 PM, Mark Bergsma mark@wikimedia.org wrote:

...
Hi everyone,

After we've been successfully serving our sites from our backup data-center codfw (Dallas) for the past two days, we're now starting our switch back to eqiad (Ashburn) as planned[1].

We've already moved cache traffic back to eqiad, and within the next minutes, we'll disable editing by going read-only for approximately 30 minutes - hopefully a bit faster than 2 days ago.

[1] http://blog.wikimedia.org/2016/04/11/wikimedia-failover-test/

On Tue, Apr 19, 2016 at 6:00 PM, Mark Bergsma mark@wikimedia.org wrote:

...
Hi all,

Today the data center switch-over commenced as planned, and has just fully completed successfully. We are now serving our sites from codfw (Dallas, Texas) for the next 2 days if all stays well.

We switched the wikis to read-only (editing disabled) at 14:02 UTC, and went back read-write at 14:48 UTC - a little longer than planned. While edits were possible then, unfortunately at that time Special:Recent Changes (and related change feeds) were not yet working due to an unexpected configuration problem with our Redis servers until 15:10 UTC, when we found and fixed the issue. The site has stayed up and available for readers throughout the entire migration.

Overall the procedure was a success with few problems along the way. However we've also carefully kept track of any issues and delays we encountered for evaluation to improve and speed up the procedure, and reducing impact to our users - some of which will already be implemented for our switch back on Thursday.

We're still expecting to find (possibly subtle) issues today, and would like everyone who notices anything to use the following channels to report them:

File a Phabricator issue with project #codfw-rollout

Report issues on IRC: Freenode channel #wikimedia-tech (if urgent)

Send an e-mail to the Operations list: ops@lists.wikimedia.org

We're not done yet, but thanks to all who have helped so far. :-)

Mark

-- Mark Bergsma mark@wikimedia.org Lead Operations Architect Director of Technical Operations Wikimedia Foundation

-- Mark Bergsma mark@wikimedia.org Lead Operations Architect Director of Technical Operations Wikimedia Foundation

Ops mailing list Ops@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ops

Ops mailing list Ops@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ops

-- Arthur Richards Team Practices Manager [[User:Awjrichards]] IRC: awjr +1-415-839-6885 x6687

Wes Moran

8:34 p.m.

New subject: [Engineering] [Ops] Data center switch-over moving ahead next week: please stay available :)

Well planned, well done!

Mark thanks for the summaries and hours dedicated to making this all work well. Thanks to the many teams working together to complete this effort.

On Thu, Apr 21, 2016 at 11:55 AM, Arthur Richards arichards@wikimedia.org wrote:

...

This is so rad - congratulations indeed to everyone who's been working on this!

On Thu, Apr 21, 2016 at 8:44 AM, Toby Negrin tnegrin@wikimedia.org wrote:

...
Congrats Mark and everyone else involved. This is a big step for reliability and performance of the sites and a difficult technical task to say the least.

Well done!

-Toby

On Thu, Apr 21, 2016 at 8:37 AM, Mark Bergsma mark@wikimedia.org wrote:

...
We've just completed the switch back, and all services are running from our main data center eqiad (Ashburn) again.

The process went very smooth this time around. In the past two days leading up to this, we've been able to either fix or work around the most important issues we encountered on Tuesday. This meant that we had no real setbacks or unanticipated delays today, and therefore were able to complete the most time pressing and user-impacting part (during which MediaWiki is read-only) in 20 minutes, down from ~45 minutes two days ago.

However, we'll be doing this again in the future, and until then we'll work on improving and further automating this process to get it down to hopefully much lower levels of impact and duration.

Please let us know if you see any issues which may be caused by the switch-over(s).

Thanks much to everyone involved!

Mark

On Thu, Apr 21, 2016 at 3:53 PM, Mark Bergsma mark@wikimedia.org wrote:

...
Hi everyone,

After we've been successfully serving our sites from our backup data-center codfw (Dallas) for the past two days, we're now starting our switch back to eqiad (Ashburn) as planned[1].

We've already moved cache traffic back to eqiad, and within the next minutes, we'll disable editing by going read-only for approximately 30 minutes - hopefully a bit faster than 2 days ago.

[1] http://blog.wikimedia.org/2016/04/11/wikimedia-failover-test/

On Tue, Apr 19, 2016 at 6:00 PM, Mark Bergsma mark@wikimedia.org wrote:

...
Hi all,

Today the data center switch-over commenced as planned, and has just fully completed successfully. We are now serving our sites from codfw (Dallas, Texas) for the next 2 days if all stays well.

We switched the wikis to read-only (editing disabled) at 14:02 UTC, and went back read-write at 14:48 UTC - a little longer than planned. While edits were possible then, unfortunately at that time Special:Recent Changes (and related change feeds) were not yet working due to an unexpected configuration problem with our Redis servers until 15:10 UTC, when we found and fixed the issue. The site has stayed up and available for readers throughout the entire migration.

Overall the procedure was a success with few problems along the way. However we've also carefully kept track of any issues and delays we encountered for evaluation to improve and speed up the procedure, and reducing impact to our users - some of which will already be implemented for our switch back on Thursday.

We're still expecting to find (possibly subtle) issues today, and would like everyone who notices anything to use the following channels to report them:

File a Phabricator issue with project #codfw-rollout

Report issues on IRC: Freenode channel #wikimedia-tech (if urgent)

Send an e-mail to the Operations list: ops@lists.wikimedia.org

We're not done yet, but thanks to all who have helped so far. :-)

Mark

-- Mark Bergsma mark@wikimedia.org Lead Operations Architect Director of Technical Operations Wikimedia Foundation

-- Mark Bergsma mark@wikimedia.org Lead Operations Architect Director of Technical Operations Wikimedia Foundation

Ops mailing list Ops@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ops

Ops mailing list Ops@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ops

-- Arthur Richards Team Practices Manager [[User:Awjrichards]] IRC: awjr +1-415-839-6885 x6687

Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering

3183

Age (days ago)

3183

Last active (days ago)

wikitech-l@lists.wikimedia.org

4 comments

4 participants

tags (0)

participants (4)

Arthur Richards
Mark Bergsma
Toby Negrin
Wes Moran