Hey all,
Just a question -- do folks know if incident reports are still being completed & published on Wikitech (such that they're listed at [0])?
I just thought to ask given that - for me - the Phabricator search results at [1] list 9 tasks created since the start of this year that are tagged with `#Wikimedia-Incident` (& some folks with an NDA may be able to see more tasks than that), but the Wikitech page at [0] currently only lists there as having been one incident that's occurred since the start of this year [2].
[0] https://wikitech.wikimedia.org/wiki/Incident_status [1] https://phabricator.wikimedia.org/maniphest/?projects=wikimedia-incident&... [2] https://wikitech.wikimedia.org/wiki/Incidents/2026-02-23_ml-serve
Best, -- a smart kitten
Hello,
On Sun, 26 Apr 2026 at 21:11, a smart kitten via Wikitech-l < wikitech-l@lists.wikimedia.org> wrote:
Hey all,
Just a question -- do folks know if incident reports are still being completed & published on Wikitech (such that they're listed at [0])?
I just thought to ask given that - for me - the Phabricator search results at [1] list 9 tasks created since the start of this year that are tagged with `#Wikimedia-Incident` (& some folks with an NDA may be able to see more tasks than that), but the Wikitech page at [0] currently only lists there as having been one incident that's occurred since the start of this year [2].
The short answer is no, they are not being published regularly any more - there have been a few shifts in incident response that we need to document and align our processes on.
There are a few factors at work here: * One of the main aspects is that the majority of infrastructure-related incidents that we’ve been dealing with over the last few months have been of a sensitive nature. As per WP:DNFTT https://en.wikipedia.org/wiki/Wikipedia:Deny_recognition and our policies around sensitive information, we don’t publish information on our response to these outages even if some of the action items are fine to be public. * Another factor is the need to scale our incident tracking process. We now coordinate all incidents using Corto https://wikitech.wikimedia.org/wiki/Corto, which creates Phabricator tickets automatically - defaulting to closed. * The last factor is our incident review process. To account for the diversity of incidents and involvement of many teams, we have moved to a less rigid process where the filling out of the wikitech document and the scorecard etc don’t feature. This is the main reason for the lack of updates to the incident status page.
The process of filling out the wikitech incident reports has been a fairly arduous process of copying information out of Phabricator tasks and Google docs which duplicates effort. For the time being, given the more concerted use of Phabricator tasks in the era of Corto, we will try to update and publish Phabricator tickets wherever possible. When an incident is being closed out by the incident coordinator, they will transfer relevant information from the Google doc (if needed) and open up the task if it is suitable to do so in order to provide context on our outages and any interesting technical lessons learned. We’ll also work towards formalising a process to replace the existing process officially.
Cheers, Hugh
One of the main aspects is that the majority of infrastructure-related incidents that we’ve been dealing with over the last few months have been of a sensitive nature. As per WP:DNFTT https://en.wikipedia.org/wiki/Wikipedia:Deny_recognition and our policies around sensitive information, we don’t publish information on our response to these outages even if some of the action items are fine to be public.
Another factor is the need to scale our incident tracking process. We now
coordinate all incidents using Corto https://wikitech.wikimedia.org/wiki/Corto, which creates Phabricator tickets automatically - defaulting to closed.
From looking at the #wikimedia-incidents https://phabricator.wikimedia.org/project/board/2143/query/all/ dashboard, even with volunteer NDA access, only 3 tasks remain non-public; the "rest" are public. This means either the vast majority of such tasks are being filed in a way that even community members with appropriate access have very little visibility into the state of incidents on Wikimedia services (which, to remind you, previously used to be an open on-wiki process), or y'all really aren't having that many cases where DENY is being followed and I cannot make up my mind which in this instance is the correct answer.
The last factor is our incident review process. To account for the
diversity of incidents and involvement of many teams, we have moved to a less rigid process where the filling out of the wikitech document and the scorecard etc don’t feature. This is the main reason for the lack of updates to the incident status page.
The process of filling out the wikitech incident reports has been a fairly arduous process of copying information out of Phabricator tasks and Google docs which duplicates effort.
If the main thrust is "we would like our newer crop of engineers/managers to not use wikitext and instead use an IRC bot" (which, imo, is definitely not a great take to have given that this is Wikimedia), why not just code the bot to transparently automatically copy everything to a wikitext page (similar to SAL)?
The reason I say this is because there appears to be no current easy way to chronologically look at incident reports per year in a nice and organized manner, and the current approach being used is very hard to scale if I want to query the question, "So, the day before yesterday, somebody on VPT saw a ton of broken thumbnails, was that an incident or a me problem ?"
Regards, Sohom Datta --- Open-source contributor @Wikimedia
On Tue, May 12, 2026 at 7:29 AM Hugh Nowlan via Wikitech-l < wikitech-l@lists.wikimedia.org> wrote:
Hello,
On Sun, 26 Apr 2026 at 21:11, a smart kitten via Wikitech-l < wikitech-l@lists.wikimedia.org> wrote:
Hey all,
Just a question -- do folks know if incident reports are still being completed & published on Wikitech (such that they're listed at [0])?
I just thought to ask given that - for me - the Phabricator search results at [1] list 9 tasks created since the start of this year that are tagged with `#Wikimedia-Incident` (& some folks with an NDA may be able to see more tasks than that), but the Wikitech page at [0] currently only lists there as having been one incident that's occurred since the start of this year [2].
The short answer is no, they are not being published regularly any more - there have been a few shifts in incident response that we need to document and align our processes on.
There are a few factors at work here:
- One of the main aspects is that the majority of infrastructure-related
incidents that we’ve been dealing with over the last few months have been of a sensitive nature. As per WP:DNFTT https://en.wikipedia.org/wiki/Wikipedia:Deny_recognition and our policies around sensitive information, we don’t publish information on our response to these outages even if some of the action items are fine to be public.
- Another factor is the need to scale our incident tracking process. We
now coordinate all incidents using Corto https://wikitech.wikimedia.org/wiki/Corto, which creates Phabricator tickets automatically - defaulting to closed.
- The last factor is our incident review process. To account for the
diversity of incidents and involvement of many teams, we have moved to a less rigid process where the filling out of the wikitech document and the scorecard etc don’t feature. This is the main reason for the lack of updates to the incident status page.
The process of filling out the wikitech incident reports has been a fairly arduous process of copying information out of Phabricator tasks and Google docs which duplicates effort. For the time being, given the more concerted use of Phabricator tasks in the era of Corto, we will try to update and publish Phabricator tickets wherever possible. When an incident is being closed out by the incident coordinator, they will transfer relevant information from the Google doc (if needed) and open up the task if it is suitable to do so in order to provide context on our outages and any interesting technical lessons learned. We’ll also work towards formalising a process to replace the existing process officially.
Cheers, Hugh _______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
I've filed T426137 https://phabricator.wikimedia.org/T426137 regarding the issues I outlined in my previous email.
Regards, Sohom Datta --- Open-source contributor @Wikimedia
On Tue, May 12, 2026, 9:19 AM Sohom Datta dattasohom1@gmail.com wrote:
One of the main aspects is that the majority of infrastructure-related
incidents that we’ve been dealing with over the last few months have been of a sensitive nature. As per WP:DNFTT https://en.wikipedia.org/wiki/Wikipedia:Deny_recognition and our policies around sensitive information, we don’t publish information on our response to these outages even if some of the action items are fine to be public.
Another factor is the need to scale our incident tracking process. We
now coordinate all incidents using Corto https://wikitech.wikimedia.org/wiki/Corto, which creates Phabricator tickets automatically - defaulting to closed.
From looking at the #wikimedia-incidents https://phabricator.wikimedia.org/project/board/2143/query/all/ dashboard, even with volunteer NDA access, only 3 tasks remain non-public; the "rest" are public. This means either the vast majority of such tasks are being filed in a way that even community members with appropriate access have very little visibility into the state of incidents on Wikimedia services (which, to remind you, previously used to be an open on-wiki process), or y'all really aren't having that many cases where DENY is being followed and I cannot make up my mind which in this instance is the correct answer.
The last factor is our incident review process. To account for the
diversity of incidents and involvement of many teams, we have moved to a less rigid process where the filling out of the wikitech document and the scorecard etc don’t feature. This is the main reason for the lack of updates to the incident status page.
The process of filling out the wikitech incident reports has been a fairly arduous process of copying information out of Phabricator tasks and Google docs which duplicates effort.
If the main thrust is "we would like our newer crop of engineers/managers to not use wikitext and instead use an IRC bot" (which, imo, is definitely not a great take to have given that this is Wikimedia), why not just code the bot to transparently automatically copy everything to a wikitext page (similar to SAL)?
The reason I say this is because there appears to be no current easy way to chronologically look at incident reports per year in a nice and organized manner, and the current approach being used is very hard to scale if I want to query the question, "So, the day before yesterday, somebody on VPT saw a ton of broken thumbnails, was that an incident or a me problem ?"
Regards, Sohom Datta
Open-source contributor @Wikimedia
On Tue, May 12, 2026 at 7:29 AM Hugh Nowlan via Wikitech-l < wikitech-l@lists.wikimedia.org> wrote:
Hello,
On Sun, 26 Apr 2026 at 21:11, a smart kitten via Wikitech-l < wikitech-l@lists.wikimedia.org> wrote:
Hey all,
Just a question -- do folks know if incident reports are still being completed & published on Wikitech (such that they're listed at [0])?
I just thought to ask given that - for me - the Phabricator search results at [1] list 9 tasks created since the start of this year that are tagged with `#Wikimedia-Incident` (& some folks with an NDA may be able to see more tasks than that), but the Wikitech page at [0] currently only lists there as having been one incident that's occurred since the start of this year [2].
The short answer is no, they are not being published regularly any more - there have been a few shifts in incident response that we need to document and align our processes on.
There are a few factors at work here:
- One of the main aspects is that the majority of infrastructure-related
incidents that we’ve been dealing with over the last few months have been of a sensitive nature. As per WP:DNFTT https://en.wikipedia.org/wiki/Wikipedia:Deny_recognition and our policies around sensitive information, we don’t publish information on our response to these outages even if some of the action items are fine to be public.
- Another factor is the need to scale our incident tracking process. We
now coordinate all incidents using Corto https://wikitech.wikimedia.org/wiki/Corto, which creates Phabricator tickets automatically - defaulting to closed.
- The last factor is our incident review process. To account for the
diversity of incidents and involvement of many teams, we have moved to a less rigid process where the filling out of the wikitech document and the scorecard etc don’t feature. This is the main reason for the lack of updates to the incident status page.
The process of filling out the wikitech incident reports has been a fairly arduous process of copying information out of Phabricator tasks and Google docs which duplicates effort. For the time being, given the more concerted use of Phabricator tasks in the era of Corto, we will try to update and publish Phabricator tickets wherever possible. When an incident is being closed out by the incident coordinator, they will transfer relevant information from the Google doc (if needed) and open up the task if it is suitable to do so in order to provide context on our outages and any interesting technical lessons learned. We’ll also work towards formalising a process to replace the existing process officially.
Cheers, Hugh _______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
wikitech-l@lists.wikimedia.org