January 2020 - Cloud - lists.wikimedia.org

Webservice failures (Toolforge)
by Russell Blau 02 Feb '20

02 Feb '20

TL;DR: The lighttpd webservice for https://tools.wmflabs.org/dplbot/ fails repeatedly, frequently, and unpredictably, and I have been unable to diagnose any cause. Currently, tools.dplbot is running a php7.2 webservice on the kubernetes backend; however, the failures started occurring when it was running lighttpd on the job grid, and the move to kubernetes does not seem to have changed anything in this respect. The tool serves a variety of PHP-based pages which generate reports from the Toolforge database replicas. The symptom of failure is that all requests get rejected with 503 service unavailable. The lighttpd process continues to run (which is why I am calling this a "failure" rather than a "crash"), which means kubernetes doesn't detect any problem and doesn't restart the server, but the server does not respond to any requests. The "webservice status" command claims that the webservice is still running. Every time this happens, I have to restart the webservice. The webservice appears to fail immediately after some restarts, while in other cases it runs normally for a period of time, which is highly variable (minutes to hours) and then fails again. Even more frustrating than the constant failures is the lack of any information to allow diagnosing the cause of this. The error.log file (/data/project/dplbot/error.log) does not show any error messages corresponding to the times of failures. I tried various lighttpd debugging options, and none of these gave me anything useful. They appear to show all requests being handled normally, and no debug information at all at or or after the point of failure. I also reactivated access logging (/data/project/dplbot/access.log), and this only shows requests that were handled correctly. In other words, there is no log indicating a request that came in at/just before a failure without a corresponding response going out. If these failures were being caused spontaneously by some problem in lighttpd or in the Toolforge infrastructure, I would expect other users to be affected by them, but that doesn't seem to be the case. This has previously been reported at https://phabricator.wikimedia.org/T115231 (including more detail on the debug options I tried), where frankly I have received absolutely no assistance. I did receive one mildly helpful comment from bd808 on a related issue (https://phabricator.wikimedia.org/T218915), as follows: > ... [It is] possible to have a Kubernetes powered webservice become unresponsive to client requests due to an internal deadlock or resource exhaustion issue in the application which does not also lead to a crash of the lighttpd process itself. However, if there is an internal deadlock or resource exhaustion issue in the underlying PHP scripts, I would expect some error message in the logs, which isn't there. Also, during a recent interval when the server was up for a while, I took the time to click every single link on https://tools.wmflabs.org/dplbot/, and the server responded to every one of them, so there does not seem to be a fatal bug in any of the scripts (although this exercise revealed a few minor issues). I'm not necessarily looking for someone to solve this problem for me (although that would be nice :-) ), but just some ideas about how to identify potential causes. Right now it is basically a black hole; no information whatsoever is coming out of the webserver at the point of failure, so I can make no progress. -- Russell Blau russblau(a)imapmail.org

3 2

Webservice down
by David Richfield 31 Jan '20

31 Jan '20

Hi all, The parliament diagram tool ( https://tools.wmflabs.org/parliamentdiagram/parlitest.php ) is down. Last time it happened was a week ago: I just restarted the webservice like Alex did, but now it's down again and I'm at work, so I can't log in for the next six hours or so. Can someone restart it for me? Also, how can I find out why it keeps going down? Thanks a million! David On Sat, 28 Dec 2019 at 13:16, Alex Monk <krenair(a)gmail.com> wrote: > > I doubt it's that as the tools project didn't loose any exec instances in this issue that I'm aware of. > Anyway I started that tool up > > krenair@tools-sgebastion-07:~$ sudo become parliamentdiagram > tools.parliamentdiagram@tools-sgebastion-07:~$ webservice status > Your webservice is not running > tools.parliamentdiagram@tools-sgebastion-07:~$ webservice start > Starting webservice... > tools.parliamentdiagram@tools-sgebastion-07:~$ webservice status > Your webservice of type lighttpd is running > > On Sat, 28 Dec 2019 at 11:59, David Richfield <davidrichfield(a)gmail.com> wrote: >> >> Hi! >> >> The parliament diagram tool ( https://tools.wmflabs.org/parliamentdiagram/parlitest.php) is down, and I'm on holiday away from my computers. Is this due to this issue, and what should I be doing about it? David Richfield <davidrichfield(a)gmail.com> schrieb am Sa., 28. Dez. 2019, 12:59: > Hi! > > The parliament diagram tool ( > https://tools.wmflabs.org/parliamentdiagram/parlitest.php) is down, and > I'm on holiday away from my computers. Is this due to this issue, and what > should I be doing about it? > > Thanks > > David >

4 6

PSA about secrets
by Roy Smith 30 Jan '20

30 Jan '20

I was poking around in /data/project/ just now, looking for examples of how other tools set up their django apps. I was surprised (well, only a little) to discover that there's a few world-readable app.py files that have their django_secrets embedded in them. That's not a good idea folks. Secrets should not be stored anyplace that's world-readable.

2 1

[Cloud-announce] [Toolforge] 2020 Kubernetes cluster open for general use
by Bryan Davis 24 Jan '20

24 Jan '20

The Toolforge admins would like to invite all Toolforge Kubernetes users to begin migration of their tools to the 2020 Kubernetes cluster. Instructions for migration and other details are on Wikitech [0]. Timeline: * 2020-01-09: 2020 Kubernetes cluster available for beta testers on an opt-in basis * 2020-01-24: 2020 Kubernetes cluster general availability for migration on an opt-in basis * 2020-02-10: Automatic migration of remaining workloads from 2016 cluster to 2020 cluster by Toolforge admins We announced beta testing for this new cluster on 2020-01-09 [1]. Since then more than 70 tools have migrated, with approximately 110 tools now using it [2]. The Toolforge admins have also fixed a few small issues that our early testers noticed. We are now ready and excited to have many more tools move their workloads from the legacy Kubernetes cluster over to the new 2020 Kubernetes cluster. Thanks to Legoktm, Magnus, and others who helped during the beta testing phase by trying things out and reporting issues that they found. For most tools the migration requires a small number of manual steps [0]: * webservice stop * kubectl config use-context toolforge * alias kubectl=/usr/bin/kubectl; echo "alias kubectl=/usr/bin/kubectl" >> $HOME/.profile * webservice --backend=kubernetes [TYPE] start This could also be a good opportunity for tools to upgrade to newer language runtimes such as php7.3 and python3.7. See the list on Wikitech [3] for currently available types. When upgrading to a new runtime, do not forget to rebuild Python virtual environments, NPM packages, or Composer packages if you are using them as well. [0]: https://wikitech.wikimedia.org/wiki/News/2020_Kubernetes_cluster_migration [1]: https://lists.wikimedia.org/pipermail/cloud-announce/2020-January/000247.ht… [2]: https://tools.wmflabs.org/k8s-status/ [3]: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Kubernetes#Available_con… Bryan (on behalf of the Toolforge admins and the Cloud Services team) -- Bryan Davis Technical Engagement Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808 _______________________________________________ Wikimedia Cloud Services announce mailing list Cloud-announce(a)lists.wikimedia.org (formerly labs-announce(a)lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud-announce

1 0

[Cloud-announce] Instance downtime on January 23 15:00 UTC
by Jason Hedden 21 Jan '20

21 Jan '20

On Thursday, January 23 at 15:00 UTC we'll be temporarily removing multiple hypervisors from service to replace failed hardware. This process will require rebooting the following list of virtual machines as we migrate workloads to different hypervisors. Unfortunately, we had several hardware failures happen in a short time and need to take action as soon as possible. Full list of instances that will be rebooted during Thursday's maintenance (shown as <project>: <instance name>): antiharassment: antiharassment-web1 automation-framework: af-puppetmaster02 bastion: bastion-restricted-eqiad1-01 cloudinfra: cloud-cumin-01 cloudinfra: mx-out01 cyberbot: cyberbot-exec-01 deployment-prep: deployment-cache-upload05 deployment-prep: deployment-elastic05 deployment-prep: deployment-hadoop-test-1 deployment-prep: deployment-kafka-main-2 deployment-prep: deployment-mx02 deployment-prep: deployment-webperf11 extdist: extdist-04 fastcci: fastcci-worker2 hashtags: hashtags-prod huggle: huggle-wl iiab: medbox3-iiab language: language-eg language: language-mleb-master library-upgrader: upgrader-05 lta-tracker: tracker1 mobile: apps-talk-pages mobile: apps-team-tools monitoring: thanos-be01 monitoring: thanos-prom01 mwoffliner: mwoffliner2 mwoffliner: mwoffliner3 mwstake: mwstake ores: ores-web-06 packaging: builder01 petscan: petscan3 petscan: petscan4 phabricator: phab-tin pluggableauth: pluggableauth-server quarry: quarry-web-01 quarry: quarry-worker-02 reading-web-staging: readingwebstaging search: wdsearch2 traffic: diffscan utrs: utrs-database2 wcdo: wcdo wikibase-registry: wbregistry-01 wikilabels: wikilabels-backups-01 wikitextexp: parsing-qa-01 wm-bot: wm-bot - WMCS Team _______________________________________________ Wikimedia Cloud Services announce mailing list Cloud-announce(a)lists.wikimedia.org (formerly labs-announce(a)lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud-announce

1 0

cache control on tools-static.wmflabs.org?
by Roy Smith 20 Jan '20

20 Jan '20

Is there a way to set the cache-control header for files served by tools-static.wmflabs.org <http://tools-static.wmflabs.org/>?

2 1

Counting edits by namespace and revision_actor_temp
by Maciej Jaros 16 Jan '20

16 Jan '20

Hi. I'm doing a secret project 😉 for a presentation. I want to count edits per namespace up until certain date. There are about ~200 users so not that much, but too many to do via edit counter... Using revision_actor_temp I get results much(!) faster, but I get a different namespace (2 instead of 0). https://quarry.wmflabs.org/query/41072 Longer query, but correct results: https://quarry.wmflabs.org/query/24267 Is this a bug or a feature? 😉 Cheers, Nux.

2 1

etcd has no leader?
by Roy Smith 16 Jan '20

16 Jan '20

I just got the following error. I reran the command and it worked the second time. My limited experience with etcd says this is not a good thing, so figured you'd want to know about it. > tools.spi-tools@tools-sgebastion-08:~/www/python/src$ webservice --backend=kubernetes python3.7 restart > Traceback (most recent call last): > File "/usr/local/bin/webservice", line 318, in <module> > if stop(job, ""): > File "/usr/local/bin/webservice", line 142, in stop > job.request_stop() > File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 675, in request_stop > self._delete_obj(pykube.Service, self.webservice_label_selector) > File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 447, in _delete_obj > o.delete() > File "/usr/lib/python2.7/dist-packages/pykube/objects.py", line 96, in delete > self.api.raise_for_status(r) > File "/usr/lib/python2.7/dist-packages/pykube/http.py", line 104, in raise_for_status > raise HTTPError(payload["message"]) > pykube.exceptions.HTTPError: client: etcd member https://tools-k8s-etcd-01.tools.eqiad.wmflabs:2379 has no leader

2 1

[Cloud-announce] Pywikibot is ending Python 2 support!
by Martin Urbanec 15 Jan '20

15 Jan '20

Together with the last Python 2 release from April, 2020, Pywikibot team will release the **last version that supports Python 2**. We created a **python2" tag** marking the version, so you can continue running your Python 2 scripts using this tag, if you really need to. After that version, Pywikibot is not going to receive any further patches and bug fixes related to Python 2. Its code is going to be cleaned from Python 2 specific functions, patches, deprecations and other stuff, so make sure you'll use this tag if you still want to run Pywikibot using Python 2. Pywikibot team strongly recommends to migrate your scripts to Python 3. To make it happen, you can use Python 2to3 script installed by default with Python 2.6+, see https://docs.python.org/2/library/2to3.html. You can also just try to run your script using Python 3 (the "-simulate" parameter could be handy) and fix all the issues. If you encounter problems with the migration, you can always ask us here: https://phabricator.wikimedia.org/T242120 Best regards, Martin Urbanec and Dvorapa _______________________________________________ Wikimedia Cloud Services announce mailing list Cloud-announce(a)lists.wikimedia.org (formerly labs-announce(a)lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud-announce

1 0

[Cloud-announce] [Toolforge] New Kubernetes cluster open for beta testers
by Bryan Davis 15 Jan '20

15 Jan '20

I am happy to announce that a new and improved Kubernetes cluster is now available for use by beta testers on an opt-in basis. A page has been created on Wikitech [0] outlining the self-service migration process. Timeline: * 2020-01-09: 2020 Kubernetes cluster available for beta testers on an opt-in basis * 2020-01-23: 2020 Kubernetes cluster general availability for migration on an opt-in basis * 2020-02-10: Automatic migration of remaining workloads from 2016 cluster to 2020 cluster by Toolforge admins This new cluster has been a work in progress for more than a year within the Wikimedia Cloud Services team, and a top priority project for the past six months. About 35 tools, including https://tools.wmflabs.org/admin/, are currently running on what we are calling the "2020 Kubernetes cluster". This new cluster is running Kubernetes v1.15.6 and Docker 19.03.4. It is also using a newer authentication and authorization method (RBAC), a new ingress routing service, and a different method of integrating with the Developer account LDAP service. We have built a new tool [1] which makes the state of the Kubernetes cluster more transparent and on par with the information that we already expose for the grid engine cluster [2]. With a significant number of tools managed by Toolforge administrators already migrated to the new cluster, we are fairly confident that the basic features used by most Kubernetes tools are covered. It is likely that a few outlying issues remain to be found as more tools move, but we have confidence that we can address them quickly. This has led us to propose a fairly short period of voluntary beta testing, followed by a short general availability opt-in migration period, and finally a complete migration of all remaining tools which will be done by the Toolforge administration team for anyone who has not migrated their self. Please help with beta testing if you have some time and are willing to get help on irc, Phabricator, and the cloud(a)lists.wikimedia.org mailing list for early adopter issues you may encounter. I want to publicly praise Brooke Storm and Arturo Borrero González for the hours that they have put into reading docs, building proof of concept clusters, and improving automation and processes to make the 2020 Kubernetes cluster possible. The Toolforge community can look forward to more frequent and less disruptive software upgrades in this cluster as a direct result of this work. We have some other feature improvements in planning now that I think you will all be excited to see and use later this year! [0]: https://wikitech.wikimedia.org/wiki/News/2020_Kubernetes_cluster_migration [1]: https://tools.wmflabs.org/k8s-status/ [2]: https://tools.wmflabs.org/sge-status/ Bryan (on behalf of the Toolforge admins and the Cloud Services team) -- Bryan Davis Technical Engagement Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808 _______________________________________________ Wikimedia Cloud Services announce mailing list Cloud-announce(a)lists.wikimedia.org (formerly labs-announce(a)lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud-announce

10 25

2024

2023

2022

2021

2020

2019

2018

2017

Cloud January 2020