Hi Analytics Fellows,

TL;DR:

Yesterday we broke and fixed hive wmf.webrequest table.

Jobs not monitored by the Analytics team might have failed - Check your logs :)

Long story:

Yesterday at 9am UTC we deployed a change to the hive wmf.webrequest table that broke some of its functionality. More precisely, queries to the table that needed to read parquet columns in detail would fail with a hive internal error.

The problem had gone unnoticed for a few hours since most of our complex computation jobs run only at night, and we only got aware of it after some hours (~18am UTC, kudos @bearloga!).

We quickly fixed the issue and restarted the needed jobs over the problematic period.

Given the type of failure of the jobs with the problem, we are sure that there have been no data corruption: jobs would fail even before starting to try to compute anything. For production jobs we monitor, we know which jobs have failed and we've taken care of it, however for jobs that are not monitored (report-updater, manual scripts etc), some silent failures might have occurred. Please check your logs :)

Cheers

Joseph Allemandou

Data Engineer @ Wikimedia Foundation

IRC: joal