[Cloud-announce] [Toolforge] 2020 Kubernetes cluster automatic migration phase beginning

List overview All Threads
Download

newer

older

[Cloud-announce] Planned NFS...

Workflow for kubernetes?

Bryan Davis

21 Feb 2020 21 Feb '20

12:08 a.m.

Following a beta testing period [0] and a general use self-migration period [1], the Toolforge administration team is ready to begin the final phase of automatic migration of tools currently running on the legacy Kubernetes cluster to the 2020 Kubernetes cluster.

The migration process will involve Toolforge administrators running `webservice migrate` for each tool in the same way that self-migration happens [2]. A small number of tools are using the legacy Kubernetes cluster outside of the `webservice` system. These tools will be moved using a more manual process after move all webservices. We are currently planning on doing these migrations in several batches so that we can monitor the load and capacity of the 2020 Kubernetes cluster as we move ~640 more tools over from the legacy cluster.

Once the tools have all been moved to the 2020 cluster, we will continue with additional clean up and default configuration changes which will allow us to fully decommission the legacy cluster. We will also be updating various documentation on Wikitech during this final phase. We hope to complete this entire process by 2020-03-06 at the latest.

[0]: https://lists.wikimedia.org/pipermail/cloud-announce/2020-January/000247.htm... [1]: https://lists.wikimedia.org/pipermail/cloud-announce/2020-January/000252.htm... [2]: https://wikitech.wikimedia.org/wiki/News/2020_Kubernetes_cluster_migration#M...

Bryan (on behalf of the Toolforge admins and the Cloud Services team)

-- Bryan Davis Technical Engagement Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808 _______________________________________________ Wikimedia Cloud Services announce mailing list Cloud-announce@lists.wikimedia.org (formerly labs-announce@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud-announce

Show replies by date

Arthur Smith

21 Feb 21 Feb

4:14 p.m.

New subject: [Cloud-announce] [Toolforge] 2020 Kubernetes cluster automatic migration phase beginning

I had been planning to switch my tools over before it was forced, so I took care of that last night - thanks, it seems to have gone smoothly. And I love the new grafana dashboards!

One question - I seem to be getting some more timeout-related 500 server errors. Was there a change in how that is handled somehow (i.e. reduced time limit for response from the server)? I realize it's good practice to respond quickly, just some of the existing cases don't at the moment and I'm hitting them occasionally.

Arthur

On Thu, Feb 20, 2020 at 7:09 PM Bryan Davis bd808@wikimedia.org wrote:

...

Following a beta testing period [0] and a general use self-migration period [1], the Toolforge administration team is ready to begin the final phase of automatic migration of tools currently running on the legacy Kubernetes cluster to the 2020 Kubernetes cluster.

The migration process will involve Toolforge administrators running `webservice migrate` for each tool in the same way that self-migration happens [2]. A small number of tools are using the legacy Kubernetes cluster outside of the `webservice` system. These tools will be moved using a more manual process after move all webservices. We are currently planning on doing these migrations in several batches so that we can monitor the load and capacity of the 2020 Kubernetes cluster as we move ~640 more tools over from the legacy cluster.

Once the tools have all been moved to the 2020 cluster, we will continue with additional clean up and default configuration changes which will allow us to fully decommission the legacy cluster. We will also be updating various documentation on Wikitech during this final phase. We hope to complete this entire process by 2020-03-06 at the latest.

Bryan (on behalf of the Toolforge admins and the Cloud Services team)

Bryan Davis Technical Engagement Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808

Wikimedia Cloud Services announce mailing list Cloud-announce@lists.wikimedia.org (formerly labs-announce@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud-announce _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Arthur Smith

23 Feb 23 Feb

7:51 p.m.

New subject: [Cloud-announce] [Toolforge] 2020 Kubernetes cluster automatic migration phase beginning

Actually I am beginning to suspect the 500 server errors are caused by an out-of-memory condition. Do the new kubernetes containers have lower memory usage limits than the old ones?

Arthur

On Fri, Feb 21, 2020 at 11:14 AM Arthur Smith arthurpsmith@gmail.com wrote:

...

I had been planning to switch my tools over before it was forced, so I took care of that last night - thanks, it seems to have gone smoothly. And I love the new grafana dashboards!

One question - I seem to be getting some more timeout-related 500 server errors. Was there a change in how that is handled somehow (i.e. reduced time limit for response from the server)? I realize it's good practice to respond quickly, just some of the existing cases don't at the moment and I'm hitting them occasionally.

Arthur

On Thu, Feb 20, 2020 at 7:09 PM Bryan Davis bd808@wikimedia.org wrote:

...
Following a beta testing period [0] and a general use self-migration period [1], the Toolforge administration team is ready to begin the final phase of automatic migration of tools currently running on the legacy Kubernetes cluster to the 2020 Kubernetes cluster.

The migration process will involve Toolforge administrators running `webservice migrate` for each tool in the same way that self-migration happens [2]. A small number of tools are using the legacy Kubernetes cluster outside of the `webservice` system. These tools will be moved using a more manual process after move all webservices. We are currently planning on doing these migrations in several batches so that we can monitor the load and capacity of the 2020 Kubernetes cluster as we move ~640 more tools over from the legacy cluster.

Once the tools have all been moved to the 2020 cluster, we will continue with additional clean up and default configuration changes which will allow us to fully decommission the legacy cluster. We will also be updating various documentation on Wikitech during this final phase. We hope to complete this entire process by 2020-03-06 at the latest.

Bryan (on behalf of the Toolforge admins and the Cloud Services team)

Bryan Davis Technical Engagement Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808

Wikimedia Cloud Services announce mailing list Cloud-announce@lists.wikimedia.org (formerly labs-announce@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud-announce _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Arturo Borrero Gonzalez

24 Feb 24 Feb

11:56 a.m.

New subject: [Cloud-announce] [Toolforge] 2020 Kubernetes cluster automatic migration phase beginning

On 2/23/20 8:51 PM, Arthur Smith wrote:

...

Actually I am beginning to suspect the 500 server errors are caused by an out-of-memory condition. Do the new kubernetes containers have lower memory usage limits than the old ones?

Yes, you are right:

https://wikitech.wikimedia.org/wiki/News/2020_Kubernetes_cluster_migration#L...

hope that helps.

regards.

-- Arturo Borrero Gonzalez SRE / Wikimedia Cloud Services Wikimedia Foundation

Arthur Smith

25 Feb 25 Feb

2:39 p.m.

New subject: [Cloud-announce] [Toolforge] 2020 Kubernetes cluster automatic migration phase beginning

Thanks, I bumped up the memory request when it starts and that seems to have resolved the problem! Also I hadn't realized that you need to repeat the same commands whenever you run webservice start - I thought it retained the previous settings, but it looks like you have to be specific each time or it falls back on the old defaults?

Arthur

On Mon, Feb 24, 2020 at 6:56 AM Arturo Borrero Gonzalez < aborrero@wikimedia.org> wrote:

...

On 2/23/20 8:51 PM, Arthur Smith wrote:

...
Actually I am beginning to suspect the 500 server errors are caused by an out-of-memory condition. Do the new kubernetes containers have lower

memory

...
usage limits than the old ones?

Yes, you are right:

https://wikitech.wikimedia.org/wiki/News/2020_Kubernetes_cluster_migration#L...

hope that helps.

regards.

-- Arturo Borrero Gonzalez SRE / Wikimedia Cloud Services Wikimedia Foundation

Bryan Davis

4:07 p.m.

New subject: [Cloud-announce] [Toolforge] 2020 Kubernetes cluster automatic migration phase beginning

On Tue, Feb 25, 2020 at 7:40 AM Arthur Smith arthurpsmith@gmail.com wrote:

...

Thanks, I bumped up the memory request when it starts and that seems to have resolved the problem! Also I hadn't realized that you need to repeat the same commands whenever you run webservice start - I thought it retained the previous settings, but it looks like you have to be specific each time or it falls back on the old defaults?

I agree that this is unintuitive, but it is the current behavior. The internal implementation of `webservice stop` clears all the state data from the $HOME/service.manifest file. `webservice restart` is the only command which uses that state data directly.

Bryan

-- Bryan Davis Technical Engagement Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808

Arturo Borrero Gonzalez

24 Feb 24 Feb

11:53 a.m.

New subject: [Cloud-announce] [Toolforge] 2020 Kubernetes cluster automatic migration phase beginning

On 2/21/20 5:14 PM, Arthur Smith wrote:

...

One question - I seem to be getting some more timeout-related 500 server errors. Was there a change in how that is handled somehow (i.e. reduced time limit for response from the server)? I realize it's good practice to respond quickly, just some of the existing cases don't at the moment and I'm hitting them occasionally.

There are at least 3 proxies involved in serving Toolforge webservices requests:

1) tool main front proxy (dynamicproxy) (http) 2) kubernetes front haproxy (tcp) 3) kubernetes nginx-ingress (http) and perhaps kube-proxy (tcp)

More information here: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Networking_and_in...

This is to say, yes, serving your request as soon as possible should help the different proxy connections to don't die and work smoothly.

As of this email, we don't have any particular metrics or insights on proxies performances and this is something we could explore in the near future (create a specific grafana dashboard or something).

regards.

-- Arturo Borrero Gonzalez SRE / Wikimedia Cloud Services Wikimedia Foundation

1774

Age (days ago)

1778

Last active (days ago)

cloud@lists.wikimedia.org

6 comments

3 participants

tags (0)

participants (3)

Arthur Smith
Arturo Borrero Gonzalez
Bryan Davis