Hi Everyone,
Over the last few months, the Wikimedia Developer Advocacy team has been
working to improve technical documentation for the MediaWiki Action API
<https://www.mediawiki.org/wiki/API:Main_page>.
So far, we have:
- Started efforts to revise, simplify, and reorganize the MediaWiki
Action API pages on MediaWiki using a new documentation template for
sub-pages: https://www.mediawiki.org/wiki/API:Documentation_template
- Updated the API navigation-template:
https://www.mediawiki.org/wiki/Template:API
As we continue to make improvements to the technical documentation, we
could use your help to better guide our efforts!
Would you please take a few moments to complete the following survey and
share your opinions and experiences with us?
https://goo.gl/forms/Y5PGILb6b3awC3OJ2
*Notes about the Mediawiki Action API Survey:*
*Survey Period: *December 6, 2018 - January 6, 2019
*Privacy Policy:* This survey will be conducted via a third-party service,
which may subject it to additional terms. For more information on privacy
and data-handling, see the survey privacy statement
https://foundation.wikimedia.org/wiki/MediaWiki_Action_API_Survey_Privacy_S…
.
Thanks for your participation!
Kindly,
Sarah R. Rodlund
Technical Writer, Developer Advocacy
<https://meta.wikimedia.org/wiki/Developer_Advocacy>
srodlund(a)wikimedia.org
Tomorrow I'll be moving the grid engine master node to a new virt host.
That will cause a 15-minute outage during which new jobs (crons, or
things submitted by hand) will fail.
Existing jobs or webservices will be unaffected by the downtime.
I'll start the move at 16:00 UTC on Friday, 2018-12-21. That's 8AM in
California.
-Andrew
Hi!
Tomorrow 2018-12-20 @ 17:00 UTC (~24h from now) we will be conducting
some network maintenance in Cloud VPS (openstack).
We will be doing some works on the transport network that connects the
Neutron server to the rest of the internet. Running CloudVPS instances
will see a brief connection problem if connected to any external service
(outside CloudVPS).
If everything goes fine, according to our tests all should be fine, all
operations will be finished in just a couple of minutes.
Let us know any issue you may find. Thanks.
Hello,
Today we have disabled BigBrother in Toolforge. BigBrother was a tool
that monitored continuous jobs that failed to get restarted because they
ran into corner cases where Grid Engine wasn't sufficiently smart to
re-start them (e.g. out of memory). BigBrother would continuously
monitor those jobs and duplicate that functionality on a layer above
Grid Engine.
Although very few tools used BigBrother (0.65% to be more precise), it
taxed our NFS file server constantly so keeping it around didn't make
much sense. Additionally, its functionality could be easily implemented
with a shell script running from cron.
So we've converted all tools that had a .bigbrotherrc file to using a
bigbrother.sh script that is triggered every 5min to restart jobs. If
your tool used BigBrother, please check your crontab (`crontab -l`) and
will see a few entries like this:
```
# Ensure continuous jobs are running
*/5 * * * * jlocal /data/project/tool_name/bigbrother.sh job_name job_script
```
Documentation has also been updated to reflect this change:
https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid#Bigbrother_(Depreca…
In our tests everything worked fine but please let us know if your
tool is being impacted by this change.
Regards,
--
Giovanni Tirloni
Operations Engineer
Wikimedia Cloud Services
On Monday, December 3rd, 2018 at 1700 UTC, we will be rebooting one of the two dumps NFS servers (labstore1006.wikimedia.org <http://labstore1006.wikimedia.org/>). This should cause rising load issues briefly, but should be quick enough that failing over services is likely to not be helpful. We will be failing over the web service before that time and failing it back before rebooting the partner server (labstore1007.wikimedia.org <http://labstore1007.wikimedia.org/>) on Friday, December 7th at 1700 UTC. This should not interrupt services to dumps.wikimedia.org <http://dumps.wikimedia.org/> (the site hosted on these systems) since that should be failed over to the non-rebooting partner.
Brooke Storm
Operations Engineer
Wikimedia Cloud Services
bstorm(a)wikimedia.org <mailto:bstorm@wikimedia.org>
IRC: bstorm_
I recently noticed that some of our standard kvm/nova monitoring never
got copied over from the labvirt puppet code to the cloudvirt puppet
code. Tomorrow I will merge
https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/478113/ to fix that.
Once that patch is merged, icinga will be a bit touchier on the
cloudvirts. In particular, it will alert for any cloudvirt that has 0
VMs running on it. (This turns out to be a useful thing to watch for
because we've had cases where every single kvm process died at once.)
So, all 'idle' cloudvirts should nonetheless have a canary instance.
For example, on the new analytics cloudvirts I created canaries like this:
$ OS_PROJECT_ID=testlabs openstack server create --image
7c6371d1-8411-48c7-bf73-2ef6d6ff2a15 --flavor m1.small --nic
net-id=7425e328-560c-4f00-8e99-706f3fb90bb4 --availability-zone
host:cloudvirtan1004 canary-an1004-01
Once a virt host is in full service we can leave the canaries there or
delete them -- there hasn't been any real consistent policy there.
In related news, I'm attempting to silence cloudvirt1019 and 1020
altogether with
https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/478115/ because
we reboot them twice a day and a reboot always kills any running VMs.
-Andrew
With any luck we'll have some more hardware installed by next week, so
it's time to move more projects! This is probably the last round of
bulk moves; after this it's all special cases for which I'll contact
people directly.
Tuesday, 2018-12-11: maps, wm-bot
Wednesday, 2018-12-12: mwoffliner, wildcat
Thursday, 2018-12-13: snuggle, services, commonsarchive, wikitextexp
Friday, 2018-12-14: queryrapi, wikidumpparse, wikistats, butterfly
Monday 2018-12-17: huggle, incubator, iiab, openrefine, wcdo,
wikidataconcepts
Tuesday 2018-12-18: wikimetrics, newsletter, telnet, signwriting,
ogvjs-ingetration
Wednesday 2018-12-19: multimedia, orig, security-tools, phragile,
wikistream, otrs, yandex-proxy
Thursday 2018-12-20: dashiki, etytree, partnermetrics, graphql
Some context for what this is all about can be found here:
https://phabricator.wikimedia.org/phame/post/view/120/neutron_is_here/
Please let me know if you are involved in one those projects and need to
postpone the move, or schedule a to-the-minute migration window.
- Andrew + the WMCS team
ToolsDB will be undergoing maintenance and updates, Tuesday, November 27th at 1730 UTC to 1800 UTC.
Actual outage times should be fairly brief, but during this time the database will be taken offline and the system rebooted. Due to the expected brief nature of the outage and the fact that some tables are not replicated (see https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#ToolsDB_Backups… <https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#ToolsDB_Backups…>), we are not planning on failing over to the replica at this time.
Brooke Storm
Operations Engineer
Wikimedia Cloud Services
bstorm(a)wikimedia.org <mailto:bstorm@wikimedia.org>
IRC: bstorm_
Hi,
next Tuesday, 2018-11-27 @ 17:30UTC we will reboot the
labnet1001.eqiad.wmnet server for maintenance and security updates.
This server provides virtual networking services for CloudVPS in the
main deployment (the old one, different from the eqiad1 deployment).
We won't be doing any failover prior to the reboot for operative reasons
(we measured the failover downtime is longer than the actual reboot time).
The impact of this brief reboot downtime will be:
* all VMs in the main CloudVPS deployment won't have network connectivity
* ongoing network connections (downloads, uploads) will fail and will
have to be restarted
* cross connectivity between VM instances in the main and eqiad1
deployment won't be possible
Thanks for your understanding, and let us know any issues you may find
after the reboot next week.
Hi,
next Tuesday 2018-11-20 at 17:30 UTC we will be rebooting the OSM
database (part of our data services) for maintenance and security updates.
In concrete the labstore1006.eqiad.wmnet (osmdb.eqiad.wmnet) server will
be rebooted. The other server in the cluster, labstore1007.eqiad.wmnet
has been rebooted already, but we won't be doing any pre-failover for
operative reasons.
Apologies in advance for any inconvenience, and please let us know any
issue you may find after these operations.