Tilman - done, but apologies for the not very useful link formatting on that tool tip.  I'll file a phab bug to improve that.  By the way, annotations for the pageview data can be collaboratively edited: https://meta.wikimedia.org/wiki/Dashiki:PageviewsAnnotations (unlocked for now, we'll limit access if we start having problems).

On Wed, Aug 26, 2015 at 6:22 PM, Tilman Bayer <tbayer@wikimedia.org> wrote:
Thanks for the update! And BTW kudos also for marking these as
annotations in the dashboard at https://vital-signs.wmflabs.org/
(maybe link the incident reports from there as well?)

On Wed, Aug 26, 2015 at 1:26 PM, Andrew Otto <aotto@wikimedia.org> wrote:
> Hi all,
>
> Now that we’ve had a little space to analyze the problem, I wanted to call
> out a recent webrequest data loss issue that we experienced on two separate
> occasions.
>
> We attempted to upgrade to Kafka 0.8.2.1, and it wasn’t until the second
> attempt that we actually found the problem.  Kafka 0.8.2.1 ships with a
> buggy version of Snappy[1] that causes messages to not be compressed
> properly.  This caused a ~4x increase network and disk I/O around the
> cluster all at once.
>
> We’ve documented the incidents and the occasions of significant data loss
> here:
>
> https://wikitech.wikimedia.org/wiki/Incident_documentation/20150803-Kafka
>
> https://wikitech.wikimedia.org/wiki/Incident_documentation/20150810-Kafka#Conclusions
>
> https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest
>
> This loss will affect the output of pagecount* and pageview datasets, as
> well as other webrequest generated statistics.  Please consider statistics
> that are generated from webrequest data using the following UTC hours
> unreliable:
>
>   2015-08-03T18:00 - 2015-08-03T23:00
>   2015-08-10T15:00 - 2015-08-10T21:00
>   2015-08-11T17:00 - 2015-08-11T18:00
>
> Many apologies for any inconvenience this causes.  We’ve learned a lot
> during this turmoil, and have a lot of ideas on how to hopefully prevent
> this from happening in the future, and also how to reduce loss and
> complexity if and when it does.  The analytics engineering team will be
> doing a post mortem on this soon, in which we will document these ideas.
>
> Thanks,
> -Andrew Otto
>
> [1] https://issues.apache.org/jira/browse/KAFKA-2189
>
>
> _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>



--
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics