[Wikitech-l] Re: Research on Wikimedia Production Errors

8 Jun 2023

I'm in no way an expert in this area. But from what I have seen the
past years I think I can identify two repeating patterns:

1. Minor programming mistakes in unrelated code. This happens often
when we add more strict types to existing code, or make it throw
exceptions when it's called in a way it should never have been called.
E.g. when a method that expects a string is called with null. Tests
can rarely catch such "unthinkable" edge cases beforehand. They bubble
up in production where codebases work together in ways that have never
been part of any automated or manual sest setup. Luckily this kind of
error is often easy to fix or safe to ignore.

2. Database hickups. Errors that appear to be "random" and are really
hard, if not impossible to reproduce. Sometimes it turns out the
reason is a really, really old database row that was created with very
different constraints in mind. More recent code might have a different
idea how a particular database table works nowadays and fails when
faced with incompatible data. Or we find that the database schema on
certain replication machines is not what it should be. For example
foreign keys to tables that shouldn't exist any more since 18 years,
but somehow still do. ;-) https://phabricator.wikimedia.org/T299387

Let's say I'm interested, but have no research at hand. :-)

Best
Thiemo

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] Re: Research on Wikimedia Production Errors