- Cloud-announce - lists.wikimedia.org

NFS maintenance tomorrow 2020-06-11
by Brooke Storm 11 Jun '20

11 Jun '20

Tomorrow (June 11th) at 1600 UTC, we will be failing over the primary NFS server to do maintenance and upgrades on it. The secondary partner in the cluster is already upgraded and ready, and recent changes *should* make it a fairly straightforward failover with a brief period of high load. If it doesn't proceed smoothly, it will be a slightly longer period of high load and NFS lockup as failover completes (10-20 min or so). After maintenance it will be failed back, which will also, hopefully, be quick and painless. -- Brooke Storm SRE Wikimedia Cloud Services bstorm(a)wikimedia.org IRC: bstorm_

1 2

DNS updates next Tuesday 2020-06-09, 14:00-16:00 UTC
by Andrew Bogott 09 Jun '20

09 Jun '20

Hello! Next week we'll be rebuilding and upgrading the hardware that provides DNS service to cloud-vps and toolforge. These rebuilds will start at 14:00 UTC and the whole process may take 2-3 hours. It's likely that DNS lookups will be somewhat slower as clients fail over between the in-progress and the working server. In theory there should be few other user-facing effects from these upgrades. In practice, though, this isn't something that we've done for quite a while, and touching DNS is always risky since it underlies pretty much everything. Here are some things to be ready for: - As a precaution we'll be disabling Horizon during the window to prevent new VMs or DNS changes landing in an inconsistent state. - Some badly-behaved DNS clients won't fail over properly and will report errors when their primary DNS server is down. - Puppet will almost certainly experience transient failures, since Puppet is known to be one of those badly-behaved clients. - If things go very badly there may be periods of total DNS outage which will result in many WMCS-hosted services failing. There's no particular reason that this /should/ happen, but this is the worst-case scenario. For additional context, the phabricator task for this work is https://phabricator.wikimedia.org/T253780 - Andrew + the WMCS team

1 2

[reminder] Pywikibot will end Python 2 support in July!
by Martin Urbanec 09 Jun '20

09 Jun '20

As the last release of Python 2 is finally out, the July release of Pywikibot is going to be the **last release that supports Python 2**. Support of Python 3.4 and MediaWiki older than 1.19 is also going to be dropped. After this release, Pywikibot is not going to receive any further patches and bug fixes related to those Python and MediaWiki versions. Functions and other stuff specific to Python 3.4, Python 2.x or MediaWiki older than 1.19 will be removed. For your convenience, this release is marked with a "python2" git tag and it is also the last 3.0.x release. In case you really need it, the Pywikibot team created /shared/pywikibot/core_python2 repository in Toolforge and a python2-pywikibot package in software repositories of some operating systems. The Pywikibot team strongly recommends that you migrate your scripts from Python 2 to Python 3. The migration steps were described in the previous message, which can be found here: https://lists.wikimedia.org/pipermail/pywikibot/2020-January/009976.html Detailed plan of Python 2 deprecation with dates is described here: https://www.mediawiki.org/wiki/Manual:Pywikibot/Compatibility If you encounter any problems with the migration, you can always ask us here: https://phabricator.wikimedia.org/T242120 Best regards, Pywikibot team

2 1

[Cloud VPS] Puppet labs/private.git data loss incident affecting some projects
by Bryan Davis 04 Jun '20

04 Jun '20

At 2020-06-04T11:12 UTC a change was merged to the operations/puppet.git repository which resulted in data loss for Cloud VPS projects using a local Puppetmaster (role::puppetmaster::standalone). The specific data loss is removal of any local to the Puppetmaster instance commits overlaid on the upstream labs/private.git repository. These patches would have contained passwords, ssh keys, TLS certificates, and similar authentication information for Puppet managed configuration. The majority of Cloud VPS projects are not affected by this configuration data loss. Several highly used and visible projects, including Toolforge (tools) and Beta Cluster (deployment-prep), have some impact. We have disabled Puppet across all Cloud VPS instances that were reachable by our central command and control service (cumin) and are currently evaluating impact and recovering data from /var/logs/puppet.log change logs where available. More information will be collected at <https://phabricator.wikimedia.org/T254491> and an incident report will also be prepared once the initial response is complete. Bryan -- Bryan Davis Technical Engagement Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808

2 1

WMCS network interruptions on Thursday, 2020-05-19 and 14:00 UTC
by Andrew Bogott 21 May '20

21 May '20

On Thursday we will be upgrading the network infrastructure that routes network traffic for all cloud-vps and toolforge networking. This will involve at least one failover between hosts, which will interrupt existing network connections. There should not be a prolonged network outage, but some connections will be reset. If your tools or services are not resilient to unexpected disconnections, you may need to manually restart services after the updates are complete. The changeover will happen sometime between 14:00 and 15:00 UTC this coming Thursday. Details about this can be found at https://phabricator.wikimedia.org/T253124 -Andrew + the WMCS team

1 2

Toolforge grid now using tesseract-ocr 4.1.1
by Arturo Borrero Gonzalez 20 May '20

20 May '20

Hi there! We just deployed tesseract-ocr v4.1.1 in the Toolforge grid. The context of this update is the phabricator task T247422 [0]. Please report any issue you may find. regards! [0] https://phabricator.wikimedia.org/T247422 -- Arturo Borrero Gonzalez SRE / Wikimedia Cloud Services Wikimedia Foundation

1 0

Planned changes to the views on the wiki-replicas
by Brooke Storm 11 May '20

11 May '20

*What is happening*: In preparation for work upstream on the production wiki databases, the Wikireplica service needs to drop some columns from the views used by Toolforge and CloudVPS users. The columns being dropped are: * archive.ar_text_id * archive.ar_content_model * archive.ar_content_format * revision.rev_text_id * revision.rev_content_model * revision.rev_content_format NOTE: revision.rev_content_format and revision.rev_text_id are only relevant when loading serialized blobs from external storage, which is not possible from the Wiki Replicas. These columns are removed without replacement. These columns currently contain stale data in both the replicas and the production databases. The actual data used in production was moved entirely to the "slot" and "content" tables on 2019-11-18 (<https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/551551/>). Information about migrating tools to the new schema is available in the description of this task: https://phabricator.wikimedia.org/T174047 Additional information about the overall project and changes can be found here: https://www.mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema The columns will continue to exist in the the revision_compat and archive_compat tables as a stop-gap to keep a tool that relies on those fields from completely breaking while you work on updating to the new schema. These two views are expected to perform poorly because of they include joins against the content and slots tables. Please only use those if you need them and will take longer to finish refactoring your code to the new schema. *When is this happening*: This will take time to run across the replicas and databases, possibly over the course of a few days, beginning 2020-05-25. There is a process of depooling of the servers that will take place to allow the changes to take place. *What should I do*: Between now and the 25th of May, stop using the fields we are removing. If you don't already, make sure you use the slots and content tables instead. Progress on this action will be tracked on this Phabricator task - https://phabricator.wikimedia.org/T252219. -- Brooke Storm SRE Wikimedia Cloud Services bstorm(a)wikimedia.org IRC: bstorm_

1 0

Horizon downtime tomorrow, 2020-05-12 14:00 UTC
by Andrew Bogott 11 May '20

11 May '20

Tomorrow I'm going to upgrade several of the OpenStack control nodes to Debian Buster. Due to version incompatibilities, Buster and Stretch nodes can't cooperate in the same cluster, so I will need to switch service between clusters a couple of times. If things go really well, this will cause only brief hiccups in the OpenStack APIs. More likely, though, things will get a bit tangled up and Horizon will misbehave for 20-30 minutes during the transition. The first switchover will happen between 14:00 and 15:00 tomorrow; the second switch will follow an hour or two later. -Andrew

1 0

Disk failure on cloudvirt1004 OpenStack host
by Bryan Davis 21 Apr '20

21 Apr '20

cloudvirt1004 is one of our oldest generation of hypervisor servers. The hypervisor servers are the machines which actually run the virtual machine instances for Cloud VPS projects. This physical host is experiencing an active hard disk and/or RAID controller failure. The Cloud Services team is actively attempting to fix the server and evacuate all instances running on it to other hypervisors. See <https://phabricator.wikimedia.org/T250869> for more information and progress updates. The following projects and instances are affected: * cloudvirt-canary ** canary1004-01.cloudvirt-canary.eqiad.wmflabs * commonsarchive ** commonsarchive-mwtest.commonsarchive.eqiad.wmflabs * deployment-prep ** deployment-echostore01.deployment-prep.eqiad.wmflabs ** deployment-schema-2.deployment-prep.eqiad.wmflabs * incubator ** incubator-mw.incubator.eqiad.wmflabs * machine-vision ** visionoid.machine-vision.eqiad.wmflabs * ogvjs-integration ** media-streaming.ogvjs-integration.eqiad.wmflabs * services ** Esther-outreachy-intern.services.eqiad.wmflabs * shiny-r ** discovery-testing-02.shiny-r.eqiad.wmflabs * tools ** tools-k8s-worker-38.tools.eqiad.wmflabs ** tools-k8s-worker-52.tools.eqiad.wmflabs ** tools-sgeexec-0901.tools.eqiad.wmflabs ** tools-sgewebgrid-lighttpd-0918.tools.eqiad.wmflabs ** tools-sgewebgrid-lighttpd-0919.tools.eqiad.wmflabs * toolsbeta ** toolsbeta-sgewebgrid-generic-0901.toolsbeta.eqiad.wmflabs * wikidata-autodesc ** wikidata-autodesc.wikidata-autodesc.eqiad.wmflabs * wikilink ** wikilink-prod.wikilink.eqiad.wmflabs Bryan, on behalf of the Cloud VPS admins and Cloud Services team -- Bryan Davis Technical Engagement Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808

1 0

cloud-vps maintenance tomorrow, 2019-04-16
by Andrew Bogott 16 Apr '20

16 Apr '20

We'll be upgrading the cloud services OpenStack install tomorrow, beginning at 15:00 UTC. There should be little to no interruption to VMs or Toolforge, but Horizon logins will be disabled for part of the window. Sorry for the short notice! - Andrew + the WMCS team

1 2

2024

2023

2022

2021

2020

2019

2018

2017

Cloud-announce