TL;DR: The lighttpd webservice for https://tools.wmflabs.org/dplbot/ fails repeatedly, frequently, and unpredictably, and I have been unable to diagnose any cause.
Currently, tools.dplbot is running a php7.2 webservice on the kubernetes backend; however, the failures started occurring when it was running lighttpd on the job grid, and the move to kubernetes does not seem to have changed anything in this respect. The tool serves a variety of PHP-based pages which generate reports from the Toolforge database replicas.
The symptom of failure is that all requests get rejected with 503 service unavailable. The lighttpd process continues to run (which is why I am calling this a "failure" rather than a "crash"), which means kubernetes doesn't detect any problem and doesn't restart the server, but the server does not respond to any requests. The "webservice status" command claims that the webservice is still running. Every time this happens, I have to restart the webservice. The webservice appears to fail immediately after some restarts, while in other cases it runs normally for a period of time, which is highly variable (minutes to hours) and then fails again.
Even more frustrating than the constant failures is the lack of any information to allow diagnosing the cause of this. The error.log file (/data/project/dplbot/error.log) does not show any error messages corresponding to the times of failures. I tried various lighttpd debugging options, and none of these gave me anything useful. They appear to show all requests being handled normally, and no debug information at all at or or after the point of failure. I also reactivated access logging (/data/project/dplbot/access.log), and this only shows requests that were handled correctly. In other words, there is no log indicating a request that came in at/just before a failure without a corresponding response going out.
If these failures were being caused spontaneously by some problem in lighttpd or in the Toolforge infrastructure, I would expect other users to be affected by them, but that doesn't seem to be the case.
This has previously been reported at https://phabricator.wikimedia.org/T115231 (including more detail on the debug options I tried), where frankly I have received absolutely no assistance. I did receive one mildly helpful comment from bd808 on a related issue (https://phabricator.wikimedia.org/T218915), as follows:
> ... [It is] possible to have a Kubernetes powered webservice become unresponsive to client requests due to an internal deadlock or resource exhaustion issue in the application which does not also lead to a crash of the lighttpd process itself.
However, if there is an internal deadlock or resource exhaustion issue in the underlying PHP scripts, I would expect some error message in the logs, which isn't there. Also, during a recent interval when the server was up for a while, I took the time to click every single link on https://tools.wmflabs.org/dplbot/, and the server responded to every one of them, so there does not seem to be a fatal bug in any of the scripts (although this exercise revealed a few minor issues).
I'm not necessarily looking for someone to solve this problem for me (although that would be nice :-) ), but just some ideas about how to identify potential causes. Right now it is basically a black hole; no information whatsoever is coming out of the webserver at the point of failure, so I can make no progress.
--
Russell Blau
russblau(a)imapmail.org
Hi all,
The parliament diagram tool (
https://tools.wmflabs.org/parliamentdiagram/parlitest.php ) is down.
Last time it happened was a week ago: I just restarted the webservice
like Alex did, but now it's down again and I'm at work, so I can't log
in for the next six hours or so. Can someone restart it for me?
Also, how can I find out why it keeps going down?
Thanks a million!
David
On Sat, 28 Dec 2019 at 13:16, Alex Monk <krenair(a)gmail.com> wrote:
>
> I doubt it's that as the tools project didn't loose any exec instances in
this issue that I'm aware of.
> Anyway I started that tool up
>
> krenair@tools-sgebastion-07:~$ sudo become parliamentdiagram
> tools.parliamentdiagram@tools-sgebastion-07:~$ webservice status
> Your webservice is not running
> tools.parliamentdiagram@tools-sgebastion-07:~$ webservice start
> Starting webservice...
> tools.parliamentdiagram@tools-sgebastion-07:~$ webservice status
> Your webservice of type lighttpd is running
>
> On Sat, 28 Dec 2019 at 11:59, David Richfield <davidrichfield(a)gmail.com>
wrote:
>>
>> Hi!
>>
>> The parliament diagram tool (
https://tools.wmflabs.org/parliamentdiagram/parlitest.php) is down, and I'm
on holiday away from my computers. Is this due to this issue, and what
should I be doing about it?
David Richfield <davidrichfield(a)gmail.com> schrieb am Sa., 28. Dez. 2019,
12:59:
> Hi!
>
> The parliament diagram tool (
> https://tools.wmflabs.org/parliamentdiagram/parlitest.php) is down, and
> I'm on holiday away from my computers. Is this due to this issue, and what
> should I be doing about it?
>
> Thanks
>
> David
>
I was poking around in /data/project/ just now, looking for examples of how other tools set up their django apps. I was surprised (well, only a little) to discover that there's a few world-readable app.py files that have their django_secrets embedded in them.
That's not a good idea folks. Secrets should not be stored anyplace that's world-readable.
The Toolforge admins would like to invite all Toolforge Kubernetes
users to begin migration of their tools to the 2020 Kubernetes
cluster. Instructions for migration and other details are on Wikitech
[0].
Timeline:
* 2020-01-09: 2020 Kubernetes cluster available for beta testers on an
opt-in basis
* 2020-01-24: 2020 Kubernetes cluster general availability for
migration on an opt-in basis
* 2020-02-10: Automatic migration of remaining workloads from 2016
cluster to 2020 cluster by Toolforge admins
We announced beta testing for this new cluster on 2020-01-09 [1].
Since then more than 70 tools have migrated, with approximately 110
tools now using it [2]. The Toolforge admins have also fixed a few
small issues that our early testers noticed. We are now ready and
excited to have many more tools move their workloads from the legacy
Kubernetes cluster over to the new 2020 Kubernetes cluster.
Thanks to Legoktm, Magnus, and others who helped during the beta
testing phase by trying things out and reporting issues that they
found.
For most tools the migration requires a small number of manual steps [0]:
* webservice stop
* kubectl config use-context toolforge
* alias kubectl=/usr/bin/kubectl; echo "alias
kubectl=/usr/bin/kubectl" >> $HOME/.profile
* webservice --backend=kubernetes [TYPE] start
This could also be a good opportunity for tools to upgrade to newer
language runtimes such as php7.3 and python3.7. See the list on
Wikitech [3] for currently available types. When upgrading to a new
runtime, do not forget to rebuild Python virtual environments, NPM
packages, or Composer packages if you are using them as well.
[0]: https://wikitech.wikimedia.org/wiki/News/2020_Kubernetes_cluster_migration
[1]: https://lists.wikimedia.org/pipermail/cloud-announce/2020-January/000247.ht…
[2]: https://tools.wmflabs.org/k8s-status/
[3]: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Kubernetes#Available_con…
Bryan (on behalf of the Toolforge admins and the Cloud Services team)
--
Bryan Davis Technical Engagement Wikimedia Foundation
Principal Software Engineer Boise, ID USA
[[m:User:BDavis_(WMF)]] irc: bd808
_______________________________________________
Wikimedia Cloud Services announce mailing list
Cloud-announce(a)lists.wikimedia.org (formerly labs-announce(a)lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce
On Thursday, January 23 at 15:00 UTC we'll be temporarily removing multiple
hypervisors from service to replace failed hardware. This process will
require rebooting the following list of virtual machines as we migrate
workloads to different hypervisors.
Unfortunately, we had several hardware failures happen in a short time and
need to take action as soon as possible.
Full list of instances that will be rebooted during Thursday's maintenance
(shown as <project>: <instance name>):
antiharassment: antiharassment-web1
automation-framework: af-puppetmaster02
bastion: bastion-restricted-eqiad1-01
cloudinfra: cloud-cumin-01
cloudinfra: mx-out01
cyberbot: cyberbot-exec-01
deployment-prep: deployment-cache-upload05
deployment-prep: deployment-elastic05
deployment-prep: deployment-hadoop-test-1
deployment-prep: deployment-kafka-main-2
deployment-prep: deployment-mx02
deployment-prep: deployment-webperf11
extdist: extdist-04
fastcci: fastcci-worker2
hashtags: hashtags-prod
huggle: huggle-wl
iiab: medbox3-iiab
language: language-eg
language: language-mleb-master
library-upgrader: upgrader-05
lta-tracker: tracker1
mobile: apps-talk-pages
mobile: apps-team-tools
monitoring: thanos-be01
monitoring: thanos-prom01
mwoffliner: mwoffliner2
mwoffliner: mwoffliner3
mwstake: mwstake
ores: ores-web-06
packaging: builder01
petscan: petscan3
petscan: petscan4
phabricator: phab-tin
pluggableauth: pluggableauth-server
quarry: quarry-web-01
quarry: quarry-worker-02
reading-web-staging: readingwebstaging
search: wdsearch2
traffic: diffscan
utrs: utrs-database2
wcdo: wcdo
wikibase-registry: wbregistry-01
wikilabels: wikilabels-backups-01
wikitextexp: parsing-qa-01
wm-bot: wm-bot
- WMCS Team
_______________________________________________
Wikimedia Cloud Services announce mailing list
Cloud-announce(a)lists.wikimedia.org (formerly labs-announce(a)lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce
Hi.
I'm doing a secret project 😉 for a presentation. I want to count edits
per namespace up until certain date. There are about ~200 users so not
that much, but too many to do via edit counter...
Using revision_actor_temp I get results much(!) faster, but I get a
different namespace (2 instead of 0).
https://quarry.wmflabs.org/query/41072
Longer query, but correct results:
https://quarry.wmflabs.org/query/24267
Is this a bug or a feature? 😉
Cheers,
Nux.
I just got the following error. I reran the command and it worked the second time.
My limited experience with etcd says this is not a good thing, so figured you'd want to know about it.
> tools.spi-tools@tools-sgebastion-08:~/www/python/src$ webservice --backend=kubernetes python3.7 restart
> Traceback (most recent call last):
> File "/usr/local/bin/webservice", line 318, in <module>
> if stop(job, ""):
> File "/usr/local/bin/webservice", line 142, in stop
> job.request_stop()
> File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 675, in request_stop
> self._delete_obj(pykube.Service, self.webservice_label_selector)
> File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 447, in _delete_obj
> o.delete()
> File "/usr/lib/python2.7/dist-packages/pykube/objects.py", line 96, in delete
> self.api.raise_for_status(r)
> File "/usr/lib/python2.7/dist-packages/pykube/http.py", line 104, in raise_for_status
> raise HTTPError(payload["message"])
> pykube.exceptions.HTTPError: client: etcd member https://tools-k8s-etcd-01.tools.eqiad.wmflabs:2379 has no leader
Together with the last Python 2 release from April, 2020, Pywikibot team
will release the **last version that supports Python 2**. We created a
**python2" tag** marking the version, so you can continue running your
Python 2 scripts using this tag, if you really need to.
After that version, Pywikibot is not going to receive any further patches
and bug fixes related to Python 2. Its code is going to be cleaned from
Python 2 specific functions, patches, deprecations and other stuff, so make
sure you'll use this tag if you still want to run Pywikibot using Python 2.
Pywikibot team strongly recommends to migrate your scripts to Python 3. To
make it happen, you can use Python 2to3 script installed by default with
Python 2.6+, see https://docs.python.org/2/library/2to3.html. You can also
just try to run your script using Python 3 (the "-simulate" parameter could
be handy) and fix all the issues. If you encounter problems with the
migration, you can always ask us here:
https://phabricator.wikimedia.org/T242120
Best regards,
Martin Urbanec and Dvorapa
_______________________________________________
Wikimedia Cloud Services announce mailing list
Cloud-announce(a)lists.wikimedia.org (formerly labs-announce(a)lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce
I am happy to announce that a new and improved Kubernetes cluster is
now available for use by beta testers on an opt-in basis. A page has
been created on Wikitech [0] outlining the self-service migration
process.
Timeline:
* 2020-01-09: 2020 Kubernetes cluster available for beta testers on an
opt-in basis
* 2020-01-23: 2020 Kubernetes cluster general availability for
migration on an opt-in basis
* 2020-02-10: Automatic migration of remaining workloads from 2016
cluster to 2020 cluster by Toolforge admins
This new cluster has been a work in progress for more than a year
within the Wikimedia Cloud Services team, and a top priority project
for the past six months. About 35 tools, including
https://tools.wmflabs.org/admin/, are currently running on what we are
calling the "2020 Kubernetes cluster". This new cluster is running
Kubernetes v1.15.6 and Docker 19.03.4. It is also using a newer
authentication and authorization method (RBAC), a new ingress routing
service, and a different method of integrating with the Developer
account LDAP service. We have built a new tool [1] which makes the
state of the Kubernetes cluster more transparent and on par with the
information that we already expose for the grid engine cluster [2].
With a significant number of tools managed by Toolforge administrators
already migrated to the new cluster, we are fairly confident that the
basic features used by most Kubernetes tools are covered. It is likely
that a few outlying issues remain to be found as more tools move, but
we have confidence that we can address them quickly. This has led us
to propose a fairly short period of voluntary beta testing, followed
by a short general availability opt-in migration period, and finally a
complete migration of all remaining tools which will be done by the
Toolforge administration team for anyone who has not migrated their
self.
Please help with beta testing if you have some time and are willing to
get help on irc, Phabricator, and the cloud(a)lists.wikimedia.org
mailing list for early adopter issues you may encounter.
I want to publicly praise Brooke Storm and Arturo Borrero González for
the hours that they have put into reading docs, building proof of
concept clusters, and improving automation and processes to make the
2020 Kubernetes cluster possible. The Toolforge community can look
forward to more frequent and less disruptive software upgrades in this
cluster as a direct result of this work. We have some other feature
improvements in planning now that I think you will all be excited to
see and use later this year!
[0]: https://wikitech.wikimedia.org/wiki/News/2020_Kubernetes_cluster_migration
[1]: https://tools.wmflabs.org/k8s-status/
[2]: https://tools.wmflabs.org/sge-status/
Bryan (on behalf of the Toolforge admins and the Cloud Services team)
--
Bryan Davis Technical Engagement Wikimedia Foundation
Principal Software Engineer Boise, ID USA
[[m:User:BDavis_(WMF)]] irc: bd808
_______________________________________________
Wikimedia Cloud Services announce mailing list
Cloud-announce(a)lists.wikimedia.org (formerly labs-announce(a)lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce