New subject: [Ops] udp2log shutdown (for analytics instances) next week

21 Apr 2015

Hi all!

Now that all data that is generated by udp2log is also being generated by the Analytics
Cluster, we are finally ready to turn off analytics udp2log instances.  I will start with
the ones that are used to generate the logs on stat1002 at /a/squid/archive.  The
(identical) cluster generated logs can be found on stat1002 at /a/log/webrequest/archive. 
I will paste the contents of the README file in /a/squid/archive describing the
differences at the bottom of this email.

If you use any of the logs in /a/squid/archive for regular statistics, you will need to
switch your code to use files in /a/log/webrequest/archive instead.  I plan to start
turning off udp2log instances on  Monday April 27th (that’s next week!).

From the README:

[@stat1002:/a/squid/archive] $ cat README.migrate-to-hive.2015-02-17
***********************************************************************
*                                                                     *
*  This directory will run stale once udp2log will get turned off.    *
*  Please use the corresponding TSVs from /a/log/webrequest/archive/  *
*  instead.                                                           *
*                                                                     *
***********************************************************************

The TSV files in this directory underneath /a/squid/archive get
generated by udp2log and suffer from

* Sub-par data quality (E.g.: udp2log had an inherent loss).
* Lack of a way to backfill/fix data.
* Some files consuming https requests twice, which made filtering
  necessary.
* Consfusing naming scheme, where each file covered 24 hours, but not
  midnight to midnight, but ~06:30 previous day to ~06:30 current day.

The new TSVs at /a/log/webrequest/archive/ contain the same
information but get generated by Hive, and address the above four
issues:

* By using Hive's webrequest table as input, the inherent loss is
  gone. Also statistics on the hour's data quality are available.
* Hive data allows to backfill/fix data.
* Only data from the varnishes gets picked up. So https traffic no
  longer gets duplicated.
* The files now cover 24 hours from midnight to midnight. No more
  stitching/cutting is needed to get the logs for a given day.

Please migrate to using the Hive-generated TSVs from

  /a/log/webrequest/archive/

Thanks!  I’ll keep you updated as this happens.

-Andrew Otto