Hi there,
Toolforge is a complex service. There are many moving parts and there
are always several people working on different pieces of it.
We have been creating informal Toolforge-specific meetings from time to
time, to unblock some decisions or to get everyone on the same page.
The proposal is to create a monthly 1h Toolforge engineering-focused
meeting called "Toolforge council".
This meeting would be open in nature:
* The WMCS/TE team
* Toolforge community root group member [0]
* Other interested parties can be invited if required
The notes and results of the meeting will be published somewhere in
wikitech and perhaps this very mailing lists.
The next two meetings of this kind will be:
* 2022-11-08 at 15:00 UTC
* 2022-12-13 at 15:00 UTC
For these next two, I will facilitate/moderate them, as well as
collect/share some agenda points beforehand.
I would like to avoid formalizing any other protocols regarding the
meeting beyond what is contained in this email. It is already an
evolution to the informal approach we have been using. Let's see how it
evolves organically?
Comments welcome (including naming hehe).
[0]
https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin#What_makes_a_roo…
--
Arturo Borrero Gonzalez
Senior Site Reliability Engineer
Wikimedia Cloud Services
Wikimedia Foundation
Continuing from my post on cloud@...
On Thu, Oct 6, 2022 at 6:21 PM Bryan Davis <bd808(a)wikimedia.org> wrote:
>
> On Thu, Oct 6, 2022 at 5:39 AM Taavi Väänänen <hi(a)taavi.wtf> wrote:
> >
> > In general, I feel that over the last few months,
> > quite a lot of planning and progress reporting has moved from our
> > various public channels (most notably Phabricator and -cloud-admin on
> > IRC) to private ones. I don't particularly like this trend.
>
> I did a thing in my late afternoon yesterday that may have aggravated
> Tavvi's feelings of being left out of decision loops.
>
> I made a decision without consulting any other Toolforge admins to add
> about 300MiB of fonts to the php7.4 Docker image available for use on
> Toolforge [0]. This decision reversed my prior blocking of this exact
> same request in 2019 [1]. It also goes against at least as many years
> of the Toolforge admins telling the Toolforge member community that we
> do not "bloat" the Kubernetes containers with specialty features for a
> small number of use cases. This reversal will complicate future
> decisions on such issues by introducing this easily seen counter
> example. I acted with good intent in the moment, but I did not act
> with good judgement nor consideration of my partners in maintaining
> the Toolforge infrastructure. For that I am truly sorry.
>
> I would also like to apologize for treating what I was doing as
> "urgent" when it could have easily waited for a discussion with others
> either in code review or in other forums. This false urgency was
> counter to what I know to be the best way to treat technical decisions
> and it was disrespectful of my co-admins in the Toolforge environment.
>
> I would also like to have a conversation among the Toolforge admins
> about how to best deal with this decision going forward. That
> conversation is probably better had on Phabricator or the cloud-admin
> mailing list than here, but it should happen and it should result in
> either reverting the change that I made or jointly creating updated
> guidelines for what is and is not acceptable in the shared Kubernetes
> containers while we await better methods of managing per-tool feature
> differences.
>
> [0]: https://phabricator.wikimedia.org/T310435#8288848
> [1]: https://gerrit.wikimedia.org/r/c/operations/docker-images/toollabs-images/+…
For the fonts themselves, should we:
* Revert the change and tell svgtranslate to move back to the grid?
* Propagate the change outward by making the same/similar change to
all php images?
* Propagate the change outward by making the same/similar change to
all base images?
* Let it be.
For the bigger picture of breaking our long held stance on "bloat", I
would like to hear suggestions from y'all. If the font decision is to
revert then maybe there is nothing to talk about here. If the fonts
stay then I think there is a need to either document this as a rogue
action that has been allowed to stand which should not set a precedent
for the future or to come up with a rubric for what is allowed and
why.
I am also open to hearing from anyone on or off list who feels that I
need to make additional amends to the Toolforge admins, the Toolforge
user community, or any particular individuals. I really didn't mean to
make a mess, but I did and I would like to work towards correcting
that as much as possible.
Bryan
PS I will be out of office until 2022-10-11, but I will try to check
in on this thread in the intervening days.
--
Bryan Davis Technical Engagement Wikimedia Foundation
Principal Software Engineer Boise, ID USA
[[m:User:BDavis_(WMF)]] irc: bd808
Hi there,
just today I introduced the $::wmcs_project var to replace $::labsproject
https://gerrit.wikimedia.org/r/c/operations/puppet/+/849050
Please use the modern one.
Perhaps I should try to nerd-snipe someone to see if we can have the
linter reject new patches with the old variable.
regards.
--
Arturo Borrero Gonzalez
Senior Site Reliability Engineer
Wikimedia Cloud Services
Wikimedia Foundation
Hi cloud-admin@,
The recent cloud@ thread made me realize that I should probably keep
everyone else more up to date on the infrastructure level projects I'm
working on by myself. So I've tried to summarize the major recent and
upcoming changes I'm working on below in semi-random order.
Please let me know if you find this useful or interesting (or if you
don't, helps to know that too). Questions and comments on are also welcome.
Terraform
I sent Puppet patches[0] to enable application credential authentication
in Keystone to let arbitrary clients speak to the OpenStack APIs. I
believe Andrew is working on the firewall rules and related HAProxy
config to open up the APIs to the public as a part of the
Cumin/Spicerack work going on at the moment.
I tagged version the initial version of the custom terraform-cloudvps
Terraform provider.[1] The provider is designed to supplement the
'official' OpenStack provider and currently lets you interact with the
web proxy API using the new go-cloudvps library[1], with Puppet ENC
support next up in my Terraform TODO list.
There's also a Puppet patch[3] pending to configure a self-hosted
Terraform registry on terraform.wmcloud.org. It's cherry picked to the
project puppet master, but having at actually merged would be nice.
[0]: https://gerrit.wikimedia.org/r/c/operations/puppet/+/840121
[1]: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/terraform-cloudvps
[2]: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/go-cloudvps
[3]: https://gerrit.wikimedia.org/r/c/operations/puppet/+/834344
CloudVPS web proxy
Planning on doing some work to make the proxy service more reliable in
case of node failure. Also planned is moving the current SQLite database
to the cloudinfra MariaDB cluster for reliability / easier failover
purposes. There are a few Puppet patches prepping for this pending
review, starting from [4].
[4]: https://gerrit.wikimedia.org/r/c/operations/puppet/+/831041
Toolforge
Sent a few patches to the jobs-framework-* repositories. Planning to do
a bit more cleanup here, to hopefully make the grid migration easier.
I'd like to introduce a new k8s utility, kube-container-updater[5], to
automatically restart long-running containers that are running outdated
images.
Upgrading to Kubernetes 1.22 is only blocked on dealing with certificate
generation for the custom webhooks[6]. For this, I'd like to get
feedback on the approach (continue to manually sign certificates or
introduce cert-manager to automate that). Looking further for the k8s
versions, 1.23 will be fairly simple I think and 1.24 will require
migrating the cluster from Docker to containerd which I'd like to pair
with a bullseye upgrade.
Once we have an object storage service I'd like to look a bit more into
providing a logging solution that doesn't use NFS.
[5]:
https://gerrit.wikimedia.org/r/c/cloud/toolforge/kube-container-updater/+/8…
[6]: https://phabricator.wikimedia.org/T286856
metricsinfra
No recent development here. I think we could roll out Prometheus
scraping to all projects and instances with the current infra, but for
that someone would need to sort out how to deal with security groups
with the pull model Prometheus uses. Some discussion about this is in
Phabricator[7].
Second thing next up in the metricsinfra road map is building an API to
let projects manage their scraping rules and alerts. I'd like to
integrate that with Terraform at some point.
[7]: https://phabricator.wikimedia.org/T288108
Puppet ENC service
Planning to do some work[8] on the ENC API service, mostly to make it
work with Terraform. Most notably the Git integration will be moved from
the Horizon dashboard to the API service itself.
[8]: https://phabricator.wikimedia.org/T317478
ToolsDB
No recent developments here either. :(
Hi there,
We are currently working on replacing older hardware servers with newer
ones, in particular those dedicated to cloud networking [0].
We have discovered a few shortcomings related mostly to network
interface naming in the newer servers, and the latest openstack version
behaving differently to what it used to be, and also some base operating
system (debian) bugs [1]. Some of these are hardware-dependant and
difficult to reproduce/anticipate in our staging environment.
The result is that we are having a more challenging and noisy migration
than we would like. We already had a few (brief) network outages trying
to introduce the new servers into service.
We'll try to keep things as stable as possible in the next few days
until the migration is completed, but we can't discard having some more
(brief) network outages until we are safely on the other side of the
transition.
I'll send another note when we finish this network maintenance is over.
regards.
[0] https://phabricator.wikimedia.org/T316284
[1] https://bugs.debian.org/989162
--
Arturo Borrero Gonzalez
Senior Site Reliability Engineer
Wikimedia Cloud Services
Wikimedia Foundation