Research on Wikimedia Production Errors

List overview All Threads
Download

newer

older

New Code of Conduct Committee...

🔥 Flame graphs arrive in...

Physikerwelt

8 Jun 2023 8 Jun '23

6:38 a.m.

Hi all,

is there any research on common causes of Wikimedia production errors?

Based on recent examples, I plan to analyze and discuss how production errors could be avoided. I am considering submitting a short paper on that to the Wikidata workshop, with the deadline Thursday, 20 July 2023 Website: https://wikidataworkshop.github.io/2023/ However, there might be better suitable venues.

I am also open to collaboration on this effort. If you are interested in a joint paper, drop me an email until the end of this week.

All the best Moritz

Show replies by date

Thiemo Kreuz

8 Jun 8 Jun

7:30 a.m.

I'm in no way an expert in this area. But from what I have seen the past years I think I can identify two repeating patterns:

1. Minor programming mistakes in unrelated code. This happens often when we add more strict types to existing code, or make it throw exceptions when it's called in a way it should never have been called. E.g. when a method that expects a string is called with null. Tests can rarely catch such "unthinkable" edge cases beforehand. They bubble up in production where codebases work together in ways that have never been part of any automated or manual sest setup. Luckily this kind of error is often easy to fix or safe to ignore.

2. Database hickups. Errors that appear to be "random" and are really hard, if not impossible to reproduce. Sometimes it turns out the reason is a really, really old database row that was created with very different constraints in mind. More recent code might have a different idea how a particular database table works nowadays and fails when faced with incompatible data. Or we find that the database schema on certain replication machines is not what it should be. For example foreign keys to tables that shouldn't exist any more since 18 years, but somehow still do. ;-) https://phabricator.wikimedia.org/T299387

Let's say I'm interested, but have no research at hand. :-)

Best Thiemo

Tyler Cipriani

5:56 p.m.

On Thu, Jun 8, 2023 at 12:40 AM Physikerwelt wiki@physikerwelt.de wrote:

...

Hi all,

is there any research on common causes of Wikimedia production errors?

Based on recent examples, I plan to analyze and discuss how production errors could be avoided. I am considering submitting a short paper on that to the Wikidata workshop, with the deadline Thursday, 20 July 2023 Website: https://wikidataworkshop.github.io/2023/ However, there might be better suitable venues.

We (Release Engineering) file production-error tasks as part of the weekly train and collect some data in the "train-stats" repo on GitLab[0]. Additionally, Timo Tijhof's "production excellence" blog posts and emails to this list may be of interest to you[1].

The "train-stats" repo collects data for "software defect prediction" based on the use of "FixCaches" or "BugCaches."[2] Each week, we record changes that fix bugs (i.e., the change uses the git trailer `Bug: TXXX` and gets backported to a currently deployed branch). The theory (per the paper linked above) is that the more often a file needs a fix, the more likely it is to cause future bugs. I have an extremely convoluted query to show the list of commonly backported files[3].

Problems with this data: - Many of these files are frequently touched files vs error-prone files (e.g., "composer.json") - Looking at the count of backports for each file means newer files are less likely to be represented - "Lower level" files may be overrepresented (although, that's probably to be expected)

In 2013, a case study used data like this inside Google and found it to be fairly accurate at predicting future bugs[4].

Also, in the case study, whenever a developer edited a file that was present in their fixCache, researchers added a bot-generated note to the patch in their code review tool.Their developers found this note unhelpful: developers already knew these files were problematic—the warning just caused confusion.

Based on that, in March 2020, we created the "Risky Change Template"[5]. My thinking was: if developers already know what's risky, then they can flag it in the train task for the week[6]. At the time, I hoped this would reduce the total version deployment time (although I have no data on that).

I hope some of this helps!

– Tyler

[0]: https://gitlab.wikimedia.org/repos/releng/train-stats [1]: < https://phabricator.wikimedia.org/phame/post/view/296/production_excellence_...

...

[2]: < https://people.csail.mit.edu/hunkim/images/3/37/Papers_kim_2007_bugcache.pdf

...

[3]: < https://data.releng.team/train?sql=select%0D%0A++filename%2C%0D%0A++project%...

...

[4]: https://doi.org/10.1109/ICSE.2013.6606583 [5]: https://wikitech.wikimedia.org/wiki/Deployments/Risky_change_template [6]: https://train-blockers.toolforge.org/ (here's an example this week: < https://phabricator.wikimedia.org/T337526#8901982%3E)

...

I am also open to collaboration on this effort. If you are interested in a joint paper, drop me an email until the end of this week.

All the best Moritz _______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

Physikerwelt

13 Jun 13 Jun

12:11 p.m.

Thank you for your feedback. I think we have a very large and open corpus of documented incidents. As said, the Wikidata workshop is not a good target conference. Following the reference

[4]: https://doi.org/10.1109/ICSE.2013.6606583

I think a paper on that would better fit https://conf.researchr.org/track/icse-2024/icse-2024-software-engineering-in...

I will continue updating the paper on Overleaf

https://www.overleaf.com/read/swswtbdyyhmg

All the best Moritz

On Thu, Jun 8, 2023 at 7:57 PM Tyler Cipriani tcipriani@wikimedia.org wrote:

...

On Thu, Jun 8, 2023 at 12:40 AM Physikerwelt wiki@physikerwelt.de wrote:

...
Hi all,

is there any research on common causes of Wikimedia production errors?

Based on recent examples, I plan to analyze and discuss how production errors could be avoided. I am considering submitting a short paper on that to the Wikidata workshop, with the deadline Thursday, 20 July 2023 Website: https://wikidataworkshop.github.io/2023/ However, there might be better suitable venues.

We (Release Engineering) file production-error tasks as part of the weekly train and collect some data in the "train-stats" repo on GitLab[0]. Additionally, Timo Tijhof's "production excellence" blog posts and emails to this list may be of interest to you[1].

The "train-stats" repo collects data for "software defect prediction" based on the use of "FixCaches" or "BugCaches."[2] Each week, we record changes that fix bugs (i.e., the change uses the git trailer `Bug: TXXX` and gets backported to a currently deployed branch). The theory (per the paper linked above) is that the more often a file needs a fix, the more likely it is to cause future bugs. I have an extremely convoluted query to show the list of commonly backported files[3].

Problems with this data:

Many of these files are frequently touched files vs error-prone files (e.g., "composer.json")

Looking at the count of backports for each file means newer files are less likely to be represented

"Lower level" files may be overrepresented (although, that's probably to be expected)

In 2013, a case study used data like this inside Google and found it to be fairly accurate at predicting future bugs[4].

Also, in the case study, whenever a developer edited a file that was present in their fixCache, researchers added a bot-generated note to the patch in their code review tool.Their developers found this note unhelpful: developers already knew these files were problematic—the warning just caused confusion.

Based on that, in March 2020, we created the "Risky Change Template"[5]. My thinking was: if developers already know what's risky, then they can flag it in the train task for the week[6]. At the time, I hoped this would reduce the total version deployment time (although I have no data on that).

I hope some of this helps!

– Tyler

...
I am also open to collaboration on this effort. If you are interested in a joint paper, drop me an email until the end of this week.

All the best Moritz _______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

539

Age (days ago)

544

Last active (days ago)

wikitech-l@lists.wikimedia.org

3 comments

3 participants

tags (0)

participants (3)

Physikerwelt
Thiemo Kreuz
Tyler Cipriani