After talking with Maryana yesterday and showing off Hive, we decided we're
going to get her an account on the cluster so she can explore directly
against the full dataset.
So, next steps:
- We'll start the process with ops to get her shell access on the kraken
machines; history teaches us this can take a while.
- I'll modify the sessions job to drop the mega-tsv step and instead update
a Hive table.
- Then I'll enable the daily runs, and kick off a backfill starting March 1.
I'll update y'all when that's done.
--
David Schoonover
dsc(a)wikimedia.org
On Wed, Apr 24, 2013 at 10:47 AM, David Schoonover <dsc(a)wikimedia.org>wrote;wrote:
zcat
/a/squid/archive/mobile/mobile-sampled-100.tsv.log-20130421.gz | awk
'{ print $5 }' | sort | uniq -c | sort
-nr | head
7706 208.80.154.x
7523 208.80.154.x
7467 208.80.154.x
7133 208.80.154.x
I'm running a job to learn more about the sessions with the most pageviews
so hopefully the mystery will be solved soon, but afaik the isPageview
filter excludes hits that match our CIDR ranges (and it has tests). I'll
certainly double-check it, as it's used everywhere. (Also, this dataset
comes from the mobile varnishes, not the squids, fwiw.)
--
David Schoonover
dsc(a)wikimedia.org
On Tue, Apr 23, 2013 at 2:29 PM, Ori Livneh <ori(a)wikimedia.org> wrote:
>
>
>
> On Tuesday, April 23, 2013 at 11:13 AM, Matthew Walker wrote:
>
> > > Max Pageviews in one Session: 141,882
> >
>
> zcat /a/squid/archive/mobile/mobile-sampled-100.tsv.log-20130421.gz | awk
'{ print $5 }' | sort | uniq -c | sort
-nr | head
7706 208.80.154.x
7523 208.80.154.x
7467 208.80.154.x
7133 208.80.154.x
>
> (I censored the last octet on the off-chance that it is sensitive.) These
> are internal IPs. If they haven't been filtered out, they're probably
> causing the huge page view count.
>
>
> --
> Ori Livneh
>
>
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/analytics
>