Following a beta testing period [0] and a general use self-migration period [1], the Toolforge administration team is ready to begin the final phase of automatic migration of tools currently running on the legacy Kubernetes cluster to the 2020 Kubernetes cluster.
The migration process will involve Toolforge administrators running `webservice migrate` for each tool in the same way that self-migration happens [2]. A small number of tools are using the legacy Kubernetes cluster outside of the `webservice` system. These tools will be moved using a more manual process after move all webservices. We are currently planning on doing these migrations in several batches so that we can monitor the load and capacity of the 2020 Kubernetes cluster as we move ~640 more tools over from the legacy cluster.
Once the tools have all been moved to the 2020 cluster, we will continue with additional clean up and default configuration changes which will allow us to fully decommission the legacy cluster. We will also be updating various documentation on Wikitech during this final phase. We hope to complete this entire process by 2020-03-06 at the latest.
[0]: https://lists.wikimedia.org/pipermail/cloud-announce/2020-January/000247.htm... [1]: https://lists.wikimedia.org/pipermail/cloud-announce/2020-January/000252.htm... [2]: https://wikitech.wikimedia.org/wiki/News/2020_Kubernetes_cluster_migration#M...
Bryan (on behalf of the Toolforge admins and the Cloud Services team)
I had been planning to switch my tools over before it was forced, so I took care of that last night - thanks, it seems to have gone smoothly. And I love the new grafana dashboards!
One question - I seem to be getting some more timeout-related 500 server errors. Was there a change in how that is handled somehow (i.e. reduced time limit for response from the server)? I realize it's good practice to respond quickly, just some of the existing cases don't at the moment and I'm hitting them occasionally.
Arthur
On Thu, Feb 20, 2020 at 7:09 PM Bryan Davis bd808@wikimedia.org wrote:
Following a beta testing period [0] and a general use self-migration period [1], the Toolforge administration team is ready to begin the final phase of automatic migration of tools currently running on the legacy Kubernetes cluster to the 2020 Kubernetes cluster.
The migration process will involve Toolforge administrators running `webservice migrate` for each tool in the same way that self-migration happens [2]. A small number of tools are using the legacy Kubernetes cluster outside of the `webservice` system. These tools will be moved using a more manual process after move all webservices. We are currently planning on doing these migrations in several batches so that we can monitor the load and capacity of the 2020 Kubernetes cluster as we move ~640 more tools over from the legacy cluster.
Once the tools have all been moved to the 2020 cluster, we will continue with additional clean up and default configuration changes which will allow us to fully decommission the legacy cluster. We will also be updating various documentation on Wikitech during this final phase. We hope to complete this entire process by 2020-03-06 at the latest.
Bryan (on behalf of the Toolforge admins and the Cloud Services team)
Bryan Davis Technical Engagement Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808
Wikimedia Cloud Services announce mailing list Cloud-announce@lists.wikimedia.org (formerly labs-announce@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud-announce _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
Actually I am beginning to suspect the 500 server errors are caused by an out-of-memory condition. Do the new kubernetes containers have lower memory usage limits than the old ones?
Arthur
On Fri, Feb 21, 2020 at 11:14 AM Arthur Smith arthurpsmith@gmail.com wrote:
I had been planning to switch my tools over before it was forced, so I took care of that last night - thanks, it seems to have gone smoothly. And I love the new grafana dashboards!
One question - I seem to be getting some more timeout-related 500 server errors. Was there a change in how that is handled somehow (i.e. reduced time limit for response from the server)? I realize it's good practice to respond quickly, just some of the existing cases don't at the moment and I'm hitting them occasionally.
Arthur
On Thu, Feb 20, 2020 at 7:09 PM Bryan Davis bd808@wikimedia.org wrote:
Following a beta testing period [0] and a general use self-migration period [1], the Toolforge administration team is ready to begin the final phase of automatic migration of tools currently running on the legacy Kubernetes cluster to the 2020 Kubernetes cluster.
The migration process will involve Toolforge administrators running `webservice migrate` for each tool in the same way that self-migration happens [2]. A small number of tools are using the legacy Kubernetes cluster outside of the `webservice` system. These tools will be moved using a more manual process after move all webservices. We are currently planning on doing these migrations in several batches so that we can monitor the load and capacity of the 2020 Kubernetes cluster as we move ~640 more tools over from the legacy cluster.
Once the tools have all been moved to the 2020 cluster, we will continue with additional clean up and default configuration changes which will allow us to fully decommission the legacy cluster. We will also be updating various documentation on Wikitech during this final phase. We hope to complete this entire process by 2020-03-06 at the latest.
Bryan (on behalf of the Toolforge admins and the Cloud Services team)
Bryan Davis Technical Engagement Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808
Wikimedia Cloud Services announce mailing list Cloud-announce@lists.wikimedia.org (formerly labs-announce@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud-announce _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
On 2/23/20 8:51 PM, Arthur Smith wrote:
Actually I am beginning to suspect the 500 server errors are caused by an out-of-memory condition. Do the new kubernetes containers have lower memory usage limits than the old ones?
Yes, you are right:
https://wikitech.wikimedia.org/wiki/News/2020_Kubernetes_cluster_migration#L...
hope that helps.
regards.
Thanks, I bumped up the memory request when it starts and that seems to have resolved the problem! Also I hadn't realized that you need to repeat the same commands whenever you run webservice start - I thought it retained the previous settings, but it looks like you have to be specific each time or it falls back on the old defaults?
Arthur
On Mon, Feb 24, 2020 at 6:56 AM Arturo Borrero Gonzalez < aborrero@wikimedia.org> wrote:
On 2/23/20 8:51 PM, Arthur Smith wrote:
Actually I am beginning to suspect the 500 server errors are caused by an out-of-memory condition. Do the new kubernetes containers have lower
memory
usage limits than the old ones?
Yes, you are right:
https://wikitech.wikimedia.org/wiki/News/2020_Kubernetes_cluster_migration#L...
hope that helps.
regards.
-- Arturo Borrero Gonzalez SRE / Wikimedia Cloud Services Wikimedia Foundation
On Tue, Feb 25, 2020 at 7:40 AM Arthur Smith arthurpsmith@gmail.com wrote:
Thanks, I bumped up the memory request when it starts and that seems to have resolved the problem! Also I hadn't realized that you need to repeat the same commands whenever you run webservice start - I thought it retained the previous settings, but it looks like you have to be specific each time or it falls back on the old defaults?
I agree that this is unintuitive, but it is the current behavior. The internal implementation of `webservice stop` clears all the state data from the $HOME/service.manifest file. `webservice restart` is the only command which uses that state data directly.
Bryan
On 2/21/20 5:14 PM, Arthur Smith wrote:
One question - I seem to be getting some more timeout-related 500 server errors. Was there a change in how that is handled somehow (i.e. reduced time limit for response from the server)? I realize it's good practice to respond quickly, just some of the existing cases don't at the moment and I'm hitting them occasionally.
There are at least 3 proxies involved in serving Toolforge webservices requests:
1) tool main front proxy (dynamicproxy) (http) 2) kubernetes front haproxy (tcp) 3) kubernetes nginx-ingress (http) and perhaps kube-proxy (tcp)
More information here: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Networking_and_in...
This is to say, yes, serving your request as soon as possible should help the different proxy connections to don't die and work smoothly.
As of this email, we don't have any particular metrics or insights on proxies performances and this is something we could explore in the near future (create a specific grafana dashboard or something).
regards.