After talking with Maryana yesterday and showing off Hive, we decided we're going to get her an account on the cluster so she can explore directly against the full dataset.

So, next steps:
- We'll start the process with ops to get her shell access on the kraken machines; history teaches us this can take a while.
- I'll modify the sessions job to drop the mega-tsv step and instead update a Hive table.
- Then I'll enable the daily runs, and kick off a backfill starting March 1.

I'll update y'all when that's done.

--
David Schoonover
dsc@wikimedia.org


On Wed, Apr 24, 2013 at 10:47 AM, David Schoonover <dsc@wikimedia.org> wrote:
zcat /a/squid/archive/mobile/mobile-sampled-100.tsv.log-20130421.gz | awk '{ print $5 }' | sort | uniq -c | sort -nr | head
   7706 208.80.154.x
   7523 208.80.154.x
   7467 208.80.154.x
   7133 208.80.154.x

I'm running a job to learn more about the sessions with the most pageviews so hopefully the mystery will be solved soon, but afaik the isPageview filter excludes hits that match our CIDR ranges (and it has tests). I'll certainly double-check it, as it's used everywhere. (Also, this dataset comes from the mobile varnishes, not the squids, fwiw.)

--
David Schoonover


On Tue, Apr 23, 2013 at 2:29 PM, Ori Livneh <ori@wikimedia.org> wrote:



On Tuesday, April 23, 2013 at 11:13 AM, Matthew Walker wrote:

> > Max Pageviews in one Session: 141,882
>

zcat /a/squid/archive/mobile/mobile-sampled-100.tsv.log-20130421.gz | awk '{ print $5 }' | sort | uniq -c | sort -nr | head
   7706 208.80.154.x
   7523 208.80.154.x
   7467 208.80.154.x
   7133 208.80.154.x

(I censored the last octet on the off-chance that it is sensitive.) These are internal IPs. If they haven't been filtered out, they're probably causing the huge page view count.


--
Ori Livneh



_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics