Le 27/11/12 17:49, David Gerard a écrit :
On 27 November 2012 16:39, Andre Klapper aklapper@wikimedia.org wrote:
I propose adding a *new* priority called "Immediate" which should only be used to mark really urgent stuff to fix. This priority would be added above the existing "Highest" priority.
Has anyone suggested a separate "urgency" parameter?
Usually the priority is the result of both severity and urgency and is determined using a matrix.
Severity would be how much it impacts our mission, for example how many users are facing the issue. Gerrit being down is less a severity than having a US datacenter in fire (literally).
Urgency is how fast you are required to fix the incident, this is usually associated with service level agreement contracted with the users. Not providing the database dump would probably be an SLA of one week whereas providing the wikis main pages is probably a 1 minute SLA.
Whenever an incident is triaged, people evaluate the impact and urgency depending on the context of the incident. Given a very simple system with only two levels (low / high) here are four examples representing all possibilities:
Change a MediaWiki setting for the Klingon wikipedia urgency : low (the change request the links to be green) severity : low (there is only a few reader for that wiki)
A database is almost out of disk space: urgency : high (whenever it is filled up we will have a major outage) severity : low (nobody is affected yet)
People get disconnected from they sessions and need to relog urgency : low (that is annoying but users can still read content) severity: high (a lot of people are facing the issue)
Wiki serving blank pages: urgency : high (our mission to share knowledge is no more fulfilled) severity: high (nobody can read content)
So now how do you determine the priority? Just assign increasing scores to each level, sum them, the highest score deserves all your attention.
Given low receives 0 points and high receives 1 point, the priorities would be something like:
0 : low - respond under 1 week, resolution target is month 1 : high - respond in 1 day, resolution target is week 2 : critical - respond immediately, resolution target is hour
Then you could add a specific status known as a major incident that would have a very specific process attached to it. Usually because you want C-Level to be made aware of it, have to gather peoples from different teams in the same place, assign someone managing everyone and finally having someone in charge of communicating with the users.
Building an incident process is definitely a though task but we are lucky to have smart people in our teams and lot of available literacy available from people that had to implements such process before us :-)