TechOps update - Wikitech-l

11 Jan 2018


      Hi everyone.
We are making some exciting changes in TechOps!
The Technical Operations team in the Technology department is possibly the oldest team in the organization. Originating from a group of volunteers (Mark being one of them) that enjoyed building and maintaining this up-and-coming, soon to become global top-10 web site as a hobby, the team has always focused on the challenge of keeping Wikimedia’s sites, services, and infrastructure working as well as possible. They did this at first on what can only be described as a shoestring budget, and still with modest resources today (more on this later).
Over time the team has grown to a professional staff of currently 18, with a pretty flat structure. Besides the other two sub teams (Traffic and Data Center  Ops) that do have a clearly defined scope, most of the team’s members as well as the majority of TechOps’s responsibilities still reside in the “Core Ops” sub team.
To strengthen the team as it continues to grow in responsibilities and membership we’ve decided to make some changes to the team’s structure, its leadership and its public profile.
Starting with the latter, we've decided to rename the team from Technical Operations to Site Reliability Engineering (SRE). SRE is a relatively modern term that more accurately describes the type of work the Technical Operations team has been doing for the past few years to some extent, as well as the path where it needs to grow into. Coined by Ben Treynor of Google, it’s now widely used across the industry. SRE describes a discipline where the emphasis is on the software engineering aspects of the work, with a focus on tools development and automation rather than human labor. Our hope is that this name change will more accurately represent the work and will help with recruiting into the team.
Second, we will increase the team’s management capacity. As the responsibilities and management/coordination/planning needs of the team kept growing, Faidon has stepped up and increased his involvement significantly. For example he covered for Mark during his paternity leave, and he has played a key leadership role in our efforts in the lawsuit against the NSA. In my time at the Foundation, I have come to rely on Faidon’s judgement, his ability to execute, and most of all on his leadership. So in recognition of Faidon’s important leadership role and responsibilities in the team, he is promoted  to Director of Site Reliability Engineering. Well done Faidon!
Mark and Faidon both will now be “Director of Site Reliability Engineering”, reporting to me. They will share some of the responsibilities of the team, such as its roadmap, and CapEx and OpEx planning and execution, as they have been doing for some time now. Each will lead one of two new sub-groups, “Service Operations” led by Mark, and “Infrastructure Foundations”, led by Faidon. The team will continue to operate as a single group responsible for the organization’s broader Site Reliability Engineering function, with both Mark and Faidon as leaders of the respective groups.
I also want to offer a few words about Mark. Mark exemplifies our values and we wouldn’t be the same without him. pubFrom driving servers around in the trunk of his car at the earliest days of the projects to building and running an exemplary team that has consistently delivered 99.98% uptime for the world’s fifth-most popular website, his work has been nothing short of heroic. He has done this with a team of 18 people, which many in our industry find incomprehensible.
Both Katherine and our Board have recognized that delivering this level of performance with our radically efficient team is not sustainable as we continue to grow and make steps towards our strategic directions of knowledge equity and knowledge as a service. Katherine has asked, and the Board has unanimously recommended, that we  step up our investment in the team. I am thrilled at their support which will enable our SRE team to have access to additional resources  within the current fiscal year.
Last but not least, and in an effort to return to his earlier days in the projects (and, in his words, an attempt to gain back some respect from his technical colleagues :-), Mark will dedicate two days a week to individual technical contributions in addition to his managerial work. Mark, thank you for your remarkable contributions!
Finally, I wanted to share more detail on our new sub team structure and scope. 
Data Center Operations
The existing Data Center Operations sub team continues as-is but will now be managed by Faidon. The team, consisting of Rob, Chris, and Papaul, is responsible for all of Wikimedia’s data center deployments and logistics as well as maintaining our presence in 8 locations across the world. They perform on-site work and maintain the full 5-year life cycle (specs, purchasing, physical install, break/fix and decommissioning) for all hardware.
Infrastructure Foundations
This new sub team will focus on building and maintaining our base platform (“metal cloud”) that forms the foundations upon which nearly everything else in our infrastructure builds upon. On top of our bare metal deployments, their responsibilities include (but are not limited to) configuration management systems, infrastructure automation, orchestration tooling, logging, metrics and monitoring as well as infrastructure security. This team consists of Riccardo, Filippo, Keith and Moritz, who will report to Faidon.
Traffic
The current Traffic sub team remains unchanged in membership, scope, and management. They are responsible for the critical first layer of high-traffic infrastructure which now spans much of the globe, including our TLS termination and caching layers, load balancing, DNS and our own network. The members of this team are Brandon, Emanuele and Arzhel as well as Valentin Gutierrez, our newly hired Traffic Security Engineer who will be starting on February 12th. They report to the team’s technical lead and manager, Brandon, who in turn will continue to report to Mark.
Data Persistence
The new Data Persistence sub team will focus on Wikimedia’s persistent data storage and retrieval systems, including (No)SQL databases, (distributed) object storage, file storage and backup systems. Today, this team will start with just our two database administrators, Jaime and Manuel, but the expectation is that this team will be built out in the near future with additional hands and expertise. They will report to Mark.
Service Operations
Finally, the Service Operations sub team will take care of public and “user-visible” services alongside Technology and Audiences teams. This includes, for example, our big MediaWiki platform, but also the newer (micro)services that comprise our stack. It also includes miscellaneous services and components that we rely upon (think Phabricator, mail systems, OTRS, etc…). The team will continue building our new SOA service infrastructure based on Kubernetes. Its membership will consist of Alexandros, Giuseppe, Ariel and Daniel, reporting to Mark.
Please welcome our new SRE team!
Victoria (with a lot of help from Mark, Faidon and the SRE team)