On Tue, Sep 15, 2020 at 11:06 AM Brennen Bearnes bbearnes@wikimedia.org wrote:
On 9/15/20 9:43 AM, Alex Ezell wrote:
Do we use levels for any of these error log outputs? That is, are they classified on output as High, Medium, Low, Info, or something like that?
Teasing out more detail about reported error severity could be a useful exercise, but I'm not sure it would result in much more meaningful signals than we currently have about production health. Serious problems can manifest as trivial-seeming notices, some issues start out that way and cascade over time, and generally any form of recurring logspam needs human evaluation before we can easily say much more than "this is a problem".
This aligns with my view of our team's ability to assign meaningful priorities. High-level general knowledge about our deployment, errors, and error logging can't substitute for domain expertise. Teams with expertise in particular codebase are best positioned to understand the impact of a particular message and derive a useful priority.
it would be most helpful if we just had more eyes _routinely_ on the logs and the workboard. (See Tyler's earlier and much more detailed/thoughtful response to this thread.)
+1 An interface between the log triage workboard and process with team/maintainer workflows is a missing component of assigning priorities.
There is a long developer feedback loop past integration. Hopefully, this process helps to shorten the feedback loop to developers and reduce the opacity of the process beyond integration through release and monitoring. Having the expertise of developers writing the code be a part of the deployment and monitoring of that code in production is the goal of this process and the key to its utility.
-- Tyler