Hi,
we just enabled email ratelimiting in our MTA server [0] in Toolforge.
Please, report any problem or issue you may find related to this.
The current limit is 100 messages per hour per sender address. We may tune the
value as we observe the behavior of the system and the users.
regards.
[0] https://en.wikipedia.org/wiki/Message_transfer_agent
--
Arturo Borrero Gonzalez
SRE / Wikimedia Cloud Services
Wikimedia Foundation
Tomorrow (June 11th) at 1600 UTC, we will be failing over the primary
NFS server to do maintenance and upgrades on it. The secondary partner
in the cluster is already upgraded and ready, and recent changes
*should* make it a fairly straightforward failover with a brief period
of high load. If it doesn't proceed smoothly, it will be a slightly
longer period of high load and NFS lockup as failover completes (10-20
min or so). After maintenance it will be failed back, which will also,
hopefully, be quick and painless.
--
Brooke Storm
SRE
Wikimedia Cloud Services
bstorm(a)wikimedia.org
IRC: bstorm_
Hello!
Next week we'll be rebuilding and upgrading the hardware that provides
DNS service to cloud-vps and toolforge. These rebuilds will start at
14:00 UTC and the whole process may take 2-3 hours. It's likely that DNS
lookups will be somewhat slower as clients fail over between the
in-progress and the working server. In theory there should be few other
user-facing effects from these upgrades.
In practice, though, this isn't something that we've done for quite a
while, and touching DNS is always risky since it underlies pretty much
everything. Here are some things to be ready for:
- As a precaution we'll be disabling Horizon during the window to
prevent new VMs or DNS changes landing in an inconsistent state.
- Some badly-behaved DNS clients won't fail over properly and will
report errors when their primary DNS server is down.
- Puppet will almost certainly experience transient failures, since
Puppet is known to be one of those badly-behaved clients.
- If things go very badly there may be periods of total DNS outage which
will result in many WMCS-hosted services failing. There's no particular
reason that this /should/ happen, but this is the worst-case scenario.
For additional context, the phabricator task for this work is
https://phabricator.wikimedia.org/T253780
- Andrew + the WMCS team
As the last release of Python 2 is finally out, the July release of
Pywikibot is going to be the **last release that supports Python 2**.
Support of Python 3.4 and MediaWiki older than 1.19 is also going to be
dropped. After this release, Pywikibot is not going to receive any further
patches and bug fixes related to those Python and MediaWiki versions.
Functions and other stuff specific to Python 3.4, Python 2.x or MediaWiki
older than 1.19 will be removed.
For your convenience, this release is marked with a "python2"
git tag and it is also the last 3.0.x release. In case you really need it,
the Pywikibot team created /shared/pywikibot/core_python2 repository in
Toolforge and a python2-pywikibot package in software repositories of some
operating systems.
The Pywikibot team strongly recommends that you migrate your scripts from
Python 2 to Python 3. The migration steps were described in the previous
message, which can be found here:
https://lists.wikimedia.org/pipermail/pywikibot/2020-January/009976.html
Detailed plan of Python 2 deprecation with dates is described here:
https://www.mediawiki.org/wiki/Manual:Pywikibot/Compatibility
If you encounter any problems with the migration, you can always ask us
here: https://phabricator.wikimedia.org/T242120
Best regards,
Pywikibot team
At 2020-06-04T11:12 UTC a change was merged to the
operations/puppet.git repository which resulted in data loss for Cloud
VPS projects using a local Puppetmaster
(role::puppetmaster::standalone). The specific data loss is removal of
any local to the Puppetmaster instance commits overlaid on the
upstream labs/private.git repository. These patches would have
contained passwords, ssh keys, TLS certificates, and similar
authentication information for Puppet managed configuration.
The majority of Cloud VPS projects are not affected by this
configuration data loss. Several highly used and visible projects,
including Toolforge (tools) and Beta Cluster (deployment-prep), have
some impact. We have disabled Puppet across all Cloud VPS instances
that were reachable by our central command and control service (cumin)
and are currently evaluating impact and recovering data from
/var/logs/puppet.log change logs where available.
More information will be collected at
<https://phabricator.wikimedia.org/T254491> and an incident report
will also be prepared once the initial response is complete.
Bryan
--
Bryan Davis Technical Engagement Wikimedia Foundation
Principal Software Engineer Boise, ID USA
[[m:User:BDavis_(WMF)]] irc: bd808
On Thursday we will be upgrading the network infrastructure that routes
network traffic for all cloud-vps and toolforge networking. This will
involve at least one failover between hosts, which will interrupt
existing network connections.
There should not be a prolonged network outage, but some connections
will be reset. If your tools or services are not resilient to
unexpected disconnections, you may need to manually restart services
after the updates are complete.
The changeover will happen sometime between 14:00 and 15:00 UTC this
coming Thursday. Details about this can be found at
https://phabricator.wikimedia.org/T253124
-Andrew + the WMCS team
Hi there!
We just deployed tesseract-ocr v4.1.1 in the Toolforge grid.
The context of this update is the phabricator task T247422 [0].
Please report any issue you may find.
regards!
[0] https://phabricator.wikimedia.org/T247422
--
Arturo Borrero Gonzalez
SRE / Wikimedia Cloud Services
Wikimedia Foundation
*What is happening*: In preparation for work upstream on the production wiki databases, the Wikireplica service needs to drop some columns from the views used by Toolforge and CloudVPS users.
The columns being dropped are:
* archive.ar_text_id
* archive.ar_content_model
* archive.ar_content_format
* revision.rev_text_id
* revision.rev_content_model
* revision.rev_content_format
NOTE: revision.rev_content_format and revision.rev_text_id are only relevant when loading serialized blobs from external storage, which is
not possible from the Wiki Replicas. These columns are removed without replacement.
These columns currently contain stale data in both the replicas and the production databases. The actual data used in production was moved
entirely to the "slot" and "content" tables on 2019-11-18 (<https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/551551/>).
Information about migrating tools to the new schema is available in the description of this task: https://phabricator.wikimedia.org/T174047
Additional information about the overall project and changes can be found here:
https://www.mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema
The columns will continue to exist in the the revision_compat and archive_compat tables as a stop-gap to keep a tool that relies on those fields from completely breaking while you work on updating to the new schema. These two views are expected to perform poorly because of they include joins against the content and slots tables. Please only use those if you need them and will take longer to finish refactoring your code to the new schema.
*When is this happening*: This will take time to run across the replicas and databases, possibly over the course of a few days, beginning 2020-05-25. There is a process of depooling of the servers that will take place to allow the changes to take place.
*What should I do*: Between now and the 25th of May, stop using the fields we are removing. If you don't already, make sure you use the slots and content tables instead.
Progress on this action will be tracked on this Phabricator task - https://phabricator.wikimedia.org/T252219.
--
Brooke Storm
SRE
Wikimedia Cloud Services
bstorm(a)wikimedia.org
IRC: bstorm_
Tomorrow I'm going to upgrade several of the OpenStack control nodes to
Debian Buster. Due to version incompatibilities, Buster and Stretch
nodes can't cooperate in the same cluster, so I will need to switch
service between clusters a couple of times.
If things go really well, this will cause only brief hiccups in the
OpenStack APIs. More likely, though, things will get a bit tangled up
and Horizon will misbehave for 20-30 minutes during the transition.
The first switchover will happen between 14:00 and 15:00 tomorrow; the
second switch will follow an hour or two later.
-Andrew
cloudvirt1004 is one of our oldest generation of hypervisor servers.
The hypervisor servers are the machines which actually run the virtual
machine instances for Cloud VPS projects. This physical host is
experiencing an active hard disk and/or RAID controller failure. The
Cloud Services team is actively attempting to fix the server and
evacuate all instances running on it to other hypervisors.
See <https://phabricator.wikimedia.org/T250869> for more information
and progress updates.
The following projects and instances are affected:
* cloudvirt-canary
** canary1004-01.cloudvirt-canary.eqiad.wmflabs
* commonsarchive
** commonsarchive-mwtest.commonsarchive.eqiad.wmflabs
* deployment-prep
** deployment-echostore01.deployment-prep.eqiad.wmflabs
** deployment-schema-2.deployment-prep.eqiad.wmflabs
* incubator
** incubator-mw.incubator.eqiad.wmflabs
* machine-vision
** visionoid.machine-vision.eqiad.wmflabs
* ogvjs-integration
** media-streaming.ogvjs-integration.eqiad.wmflabs
* services
** Esther-outreachy-intern.services.eqiad.wmflabs
* shiny-r
** discovery-testing-02.shiny-r.eqiad.wmflabs
* tools
** tools-k8s-worker-38.tools.eqiad.wmflabs
** tools-k8s-worker-52.tools.eqiad.wmflabs
** tools-sgeexec-0901.tools.eqiad.wmflabs
** tools-sgewebgrid-lighttpd-0918.tools.eqiad.wmflabs
** tools-sgewebgrid-lighttpd-0919.tools.eqiad.wmflabs
* toolsbeta
** toolsbeta-sgewebgrid-generic-0901.toolsbeta.eqiad.wmflabs
* wikidata-autodesc
** wikidata-autodesc.wikidata-autodesc.eqiad.wmflabs
* wikilink
** wikilink-prod.wikilink.eqiad.wmflabs
Bryan, on behalf of the Cloud VPS admins and Cloud Services team
--
Bryan Davis Technical Engagement Wikimedia Foundation
Principal Software Engineer Boise, ID USA
[[m:User:BDavis_(WMF)]] irc: bd808