Hi again!
Today I turned of most udp2log webrequest filters. For now, I have left the Fundraising
filters, as well as the 5xx and
sampled-1000 filters running. All of these filters are now running on erbium.
oxygen's udp2log instance has been shut off.
Instead of constantly updating this thread, I will track this
here:
https://phabricator.wikimedia.org/T97294
Thanks!
On Tue, Apr 21, 2015 at 3:49 PM, Andrew Otto <aotto(a)wikimedia.org> wrote:
Hi all!
Now that all data that is generated by udp2log is also being generated by the
Analytics Cluster, we are finally ready
to turn off analytics udp2log instances. I will start with the ones that are used
to generate the logs on stat1002 at
/a/squid/archive. The (identical) cluster generated logs can be found on stat1002
at /a/log/webrequest/archive. I
will paste the contents of the README file in /a/squid/archive describing the
differences at the bottom of this email.
If you use any of the logs in /a/squid/archive for regular statistics, you will
need to switch your code to use files
in /a/log/webrequest/archive instead. I plan to start turning off udp2log
instances on Monday April 27th (that’s next
week!).
From the README:
[@stat1002:/a/squid/archive] $ cat README.migrate-to-hive.2015-02-17
***********************************************************************
* *
* This directory will run stale once udp2log will get turned off. *
* Please use the corresponding TSVs from /a/log/webrequest/archive/ *
* instead. *
* *
***********************************************************************
The TSV files in this directory underneath /a/squid/archive get
generated by udp2log and suffer from
* Sub-par data quality (E.g.: udp2log had an inherent loss).
* Lack of a way to backfill/fix data.
* Some files consuming https requests twice, which made filtering
necessary.
* Consfusing naming scheme, where each file covered 24 hours, but not
midnight to midnight, but ~06:30 previous day to ~06:30 current day.
The new TSVs at /a/log/webrequest/archive/ contain the same
information but get generated by Hive, and address the above four
issues:
* By using Hive's webrequest table as input, the inherent loss is
gone. Also statistics on the hour's data quality are available.
* Hive data allows to backfill/fix data.
* Only data from the varnishes gets picked up. So https traffic no
longer gets duplicated.
* The files now cover 24 hours from midnight to midnight. No more
stitching/cutting is needed to get the logs for a given day.
Please migrate to using the Hive-generated TSVs from
/a/log/webrequest/archive/
Thanks! I’ll keep you updated as this happens.
-Andrew Otto