Le 27/11/12 17:49, David Gerard a écrit :
On 27 November 2012 16:39, Andre Klapper
<aklapper(a)wikimedia.org> wrote:
I propose adding a *new* priority called
"Immediate" which should only
be used to mark really urgent stuff to fix. This priority would be added
above the existing "Highest" priority.
Has anyone suggested a separate "urgency" parameter?
Usually the priority is the result of both severity and urgency and is
determined using a matrix.
Severity would be how much it impacts our mission, for example how many
users are facing the issue. Gerrit being down is less a severity than
having a US datacenter in fire (literally).
Urgency is how fast you are required to fix the incident, this is
usually associated with service level agreement contracted with the
users. Not providing the database dump would probably be an SLA of one
week whereas providing the wikis main pages is probably a 1 minute SLA.
Whenever an incident is triaged, people evaluate the impact and urgency
depending on the context of the incident. Given a very simple system
with only two levels (low / high) here are four examples representing
all possibilities:
Change a MediaWiki setting for the Klingon wikipedia
urgency : low (the change request the links to be green)
severity : low (there is only a few reader for that wiki)
A database is almost out of disk space:
urgency : high (whenever it is filled up we will have a major outage)
severity : low (nobody is affected yet)
People get disconnected from they sessions and need to relog
urgency : low (that is annoying but users can still read content)
severity: high (a lot of people are facing the issue)
Wiki serving blank pages:
urgency : high (our mission to share knowledge is no more fulfilled)
severity: high (nobody can read content)
So now how do you determine the priority? Just assign increasing scores
to each level, sum them, the highest score deserves all your attention.
Given low receives 0 points and high receives 1 point, the priorities
would be something like:
0 : low - respond under 1 week, resolution target is month
1 : high - respond in 1 day, resolution target is week
2 : critical - respond immediately, resolution target is hour
Then you could add a specific status known as a major incident that
would have a very specific process attached to it. Usually because you
want C-Level to be made aware of it, have to gather peoples from
different teams in the same place, assign someone managing everyone and
finally having someone in charge of communicating with the users.
Building an incident process is definitely a though task but we are
lucky to have smart people in our teams and lot of available literacy
available from people that had to implements such process before us :-)
--
Antoine "hashar" Musso