Hi all,
Now that we’ve had a little space to analyze the problem, I wanted to call out a recent
webrequest data loss issue that we experienced on two separate occasions.
We attempted to upgrade to Kafka 0.8.2.1, and it wasn’t until the second attempt that we
actually found the problem. Kafka 0.8.2.1 ships with a buggy version of Snappy[1] that
causes messages to not be compressed properly. This caused a ~4x increase network and
disk I/O around the cluster all at once.
We’ve documented the incidents and the occasions of significant data loss here:
https://wikitech.wikimedia.org/wiki/Incident_documentation/20150803-Kafka
<https://wikitech.wikimedia.org/wiki/Incident_documentation/20150803-Kafka>
https://wikitech.wikimedia.org/wiki/Incident_documentation/20150810-Kafka#C…
<https://wikitech.wikimedia.org/wiki/Incident_documentation/20150810-Kafka#Conclusions>
https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest
<https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest>
This loss will affect the output of pagecount* and pageview datasets, as well as other
webrequest generated statistics. Please consider statistics that are generated from
webrequest data using the following UTC hours unreliable:
2015-08-03T18:00 - 2015-08-03T23:00
2015-08-10T15:00 - 2015-08-10T21:00
2015-08-11T17:00 - 2015-08-11T18:00
Many apologies for any inconvenience this causes. We’ve learned a lot during this
turmoil, and have a lot of ideas on how to hopefully prevent this from happening in the
future, and also how to reduce loss and complexity if and when it does. The analytics
engineering team will be doing a post mortem on this soon, in which we will document these
ideas.
Thanks,
-Andrew Otto
[1]
https://issues.apache.org/jira/browse/KAFKA-2189
<https://issues.apache.org/jira/browse/KAFKA-2189>