Howdy all,
Now that the stars have aligned and: - The Mobile Frontend release with X-Analytics logging the site mode cookie (mf-m) has been out for a bit, and shows up in our logs; - We've deployed the patch to fix a critical Hadoop bug[1][2] that was blocking the job - I've personally "conquered" the wikiplague again (after being out half last week) - And finally, the script works and its results look valid and complete
...I'm happy to report the Mobile site sessions job[3] is ready to ship. I'm pretty sure this is the first view of mobile site sessions ever, so I was pretty excited. I've included some Bonus Stats from my test run which I generated 'cause I was curious :)
Before we get too excited, though, I've held off on enabling the daily job (as well as backfilling March) because it turns out that a day's worth of data generates about 16GB worth of sessions. This isn't a problem for the cluster, but we'd pretty rapidly compromise stat1001's public data storage with daily syncs. So to go forward, access to the data would probably have to be provided via private rsync. A third option is to work with the data on the cluster itself via any of the available tools; I've been using a SQL tool called Hive to validate various job runs and I can't say I'm missing MySQL. (If people are interested, I'd be happy to go over the options in more detail.)
So, we're looking for guidance on going forward. - Is the granular session output still the desired result, given the job's size? Current the job ends by coalescing the data into one giant TSV; instead it could generate a summary, or a selection of stats about the run. - If so: - Is it helpful to backfill March? - Does the data need to be publicly accessible via HTTP, or can we explore other options for providing access to the team?
I'm happy to answer any other questions as well.
Thanks!
Dave for Team Analytics
[1] The bug: https://issues.cloudera.org/browse/DISTRO-461 [2] The fix: https://mingle.corp.wikimedia.org/projects/analytics/cards/595 [3] Feature request: https://mingle.corp.wikimedia.org/projects/analytics/cards/240
---
BONUS STATS!
Notes: - As a reminder, a session (or "visit") is defined as all activity with less than 30 minutes between each hit. - The test job looked at all requests on 4/21, which is 75.17 GB of request logs. - It took ~17 minutes to process the day into 15.3 GB of sessions. (It then took 51m44s to concatenate those 28 files into one monstrous TSV for "ease" of delivery to y'all.) - The summary below took maybe 10 minutes to set up/write in Hive, and the job took maybe 7 minutes.
Visits to Mobile Site, 4/21/2013
- Total Visits: 51,624,103 - Unique Visitors: 37,736,120 - Total Pageviews: 104,972,033 - Avg Pageviews per Session: 2.0334 - Max Pageviews in one Session: 141,882
Standard Site - Visits: 51,603,221 - Unique Visitors: 37,723,188 - Pageviews: 104,910,382 - Avg Pageviews per Session: 2.033
Alpha Site - Visits: 986 - Unique Visitors: 822 - Pageviews: 7,087 - Avg Pageviews per Session: 7.188
Beta Site - Visits: 19,896 - Unique Visitors: 16,235 - Pageviews: 54,564 - Avg Pageviews per Session: 2.742
Those numbers look sane to you guys?
-- David Schoonover dsc@wikimedia.org
On Tue, Apr 23, 2013 at 11:13 AM, Matthew Walker mwalker@wikimedia.orgwrote:
Max Pageviews in one Session: 141,882
Whaa!?
Also... does this job run on the main site? Ie: for desktop browsers?
No this is just for traffic from the cp104* varnish servers (ie traffic to the mobile site). D
Yeah, 141,882 pageviews in one 30 minute session seems unlikely for a real human :) Is this picking up a bot or some other script crawling the mobile site? Also, is dynamic content loading mucking with the pageview depth stats for alpha? It seems odd that alpha users see so many more pages per session that beta or stable.
The other stuff does look sane to me, and it's pretty damn exciting to finally see a snapshot of how many users we have on beta/alpha in a given day! What I'm really interested in digging into from this:
1) seeing is a histogram of pageview depth broken out by stable/beta/alpha, something like the following:
% of users who view 1 page per session % of users who view 2-3 pages per session % ... who view 4-10 % ... who view more than 10
2) getting the same histograms just for people who view at least one special page in a given session – do they look about the same, or does their distribution skew to the right?
3) getting a dump of all the data from all the aforementioned "special" people (yes, that's what I'm calling them!) into a .tsv to play with
4) getting a dump of the top 100 referrers
It would be helpful but not essential to backfill March, if possible, to see if these trends remain stable over time. As for querying, I'd personally love to be able to do it, so I don't have to beg/distract you every time I have a shiny new request :) Let's chat sometime today?
On Tue, Apr 23, 2013 at 11:13 AM, Matthew Walker mwalker@wikimedia.orgwrote:
Max Pageviews in one Session: 141,882
Whaa!?
Also... does this job run on the main site? Ie: for desktop browsers?
~Matt Walker Wikimedia Foundation Fundraising Technology Team
Is there any reason this can't be published to mobile-l mailing list? Some really interesting data here. It would be good to see trends in stable, beta, alpha. If someone opts out of alpha / beta around the time of a deployment it suggests that either a feature they liked has moved or a new feature they dislike has made them move. These kind of things would be interesting to monitor.
Good work guys!
On Tue, Apr 23, 2013 at 11:38 AM, Maryana Pinchuk mpinchuk@wikimedia.org wrote:
Yeah, 141,882 pageviews in one 30 minute session seems unlikely for a real human :) Is this picking up a bot or some other script crawling the mobile site? Also, is dynamic content loading mucking with the pageview depth stats for alpha? It seems odd that alpha users see so many more pages per session that beta or stable.
The other stuff does look sane to me, and it's pretty damn exciting to finally see a snapshot of how many users we have on beta/alpha in a given day! What I'm really interested in digging into from this:
- seeing is a histogram of pageview depth broken out by stable/beta/alpha,
something like the following:
% of users who view 1 page per session % of users who view 2-3 pages per session % ... who view 4-10 % ... who view more than 10
- getting the same histograms just for people who view at least one special
page in a given session – do they look about the same, or does their distribution skew to the right?
- getting a dump of all the data from all the aforementioned "special"
people (yes, that's what I'm calling them!) into a .tsv to play with
- getting a dump of the top 100 referrers
It would be helpful but not essential to backfill March, if possible, to see if these trends remain stable over time. As for querying, I'd personally love to be able to do it, so I don't have to beg/distract you every time I have a shiny new request :) Let's chat sometime today?
On Tue, Apr 23, 2013 at 11:13 AM, Matthew Walker mwalker@wikimedia.org wrote:
Max Pageviews in one Session: 141,882
Whaa!?
Also... does this job run on the main site? Ie: for desktop browsers?
~Matt Walker Wikimedia Foundation Fundraising Technology Team
-- Maryana Pinchuk Associate Product Manager, Wikimedia Foundation wikimediafoundation.org
On Tue, Apr 23, 2013 at 11:44 AM, Jon Robson jrobson@wikimedia.org wrote:
Is there any reason this can't be published to mobile-l mailing list? Some really interesting data here.
Feel free to forward it to mobile-l, go for it! (don't think i am a member of that list) D
As for querying, I'd personally love to be able to do it, so I don't have to beg/distract you every time I have a shiny new request :) Let's chat sometime today?
Cool. Just grab me today whenever you're free. Say, now?
Yeah, 141,882 pageviews in one 30 minute session
You're a little mixed up. Sessions aren't 30 minute windows. A session is all hits from a given visitor such that each happens no more than 30 minutes after the last. The intuition is that a session ends when the user is idle for 30 minutes. The next hit then begins a new session. Make sense?
-- David Schoonover dsc@wikimedia.org
On Tue, Apr 23, 2013 at 11:38 AM, Maryana Pinchuk mpinchuk@wikimedia.orgwrote:
Yeah, 141,882 pageviews in one 30 minute session seems unlikely for a real human :) Is this picking up a bot or some other script crawling the mobile site? Also, is dynamic content loading mucking with the pageview depth stats for alpha? It seems odd that alpha users see so many more pages per session that beta or stable.
The other stuff does look sane to me, and it's pretty damn exciting to finally see a snapshot of how many users we have on beta/alpha in a given day! What I'm really interested in digging into from this:
- seeing is a histogram of pageview depth broken out by
stable/beta/alpha, something like the following:
% of users who view 1 page per session % of users who view 2-3 pages per session % ... who view 4-10 % ... who view more than 10
- getting the same histograms just for people who view at least one
special page in a given session – do they look about the same, or does their distribution skew to the right?
- getting a dump of all the data from all the aforementioned "special"
people (yes, that's what I'm calling them!) into a .tsv to play with
- getting a dump of the top 100 referrers
It would be helpful but not essential to backfill March, if possible, to see if these trends remain stable over time. As for querying, I'd personally love to be able to do it, so I don't have to beg/distract you every time I have a shiny new request :) Let's chat sometime today?
On Tue, Apr 23, 2013 at 11:13 AM, Matthew Walker mwalker@wikimedia.orgwrote:
Max Pageviews in one Session: 141,882
Whaa!?
Also... does this job run on the main site? Ie: for desktop browsers?
~Matt Walker Wikimedia Foundation Fundraising Technology Team
-- Maryana Pinchuk Associate Product Manager, Wikimedia Foundation wikimediafoundation.org
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Tuesday, April 23, 2013 at 11:13 AM, Matthew Walker wrote:
Max Pageviews in one Session: 141,882
zcat /a/squid/archive/mobile/mobile-sampled-100.tsv.log-20130421.gz | awk '{ print $5 }' | sort | uniq -c | sort -nr | head 7706 208.80.154.x 7523 208.80.154.x 7467 208.80.154.x 7133 208.80.154.x
(I censored the last octet on the off-chance that it is sensitive.) These are internal IPs. If they haven't been filtered out, they're probably causing the huge page view count.
-- Ori Livneh
zcat /a/squid/archive/mobile/mobile-sampled-100.tsv.log-20130421.gz | awk '{ print $5 }' | sort | uniq -c | sort -nr | head 7706 208.80.154.x 7523 208.80.154.x 7467 208.80.154.x 7133 208.80.154.x
I'm running a job to learn more about the sessions with the most pageviews so hopefully the mystery will be solved soon, but afaik the isPageview filter excludes hits that match our CIDR ranges (and it has tests). I'll certainly double-check it, as it's used everywhere. (Also, this dataset comes from the mobile varnishes, not the squids, fwiw.)
-- David Schoonover dsc@wikimedia.org
On Tue, Apr 23, 2013 at 2:29 PM, Ori Livneh ori@wikimedia.org wrote:
On Tuesday, April 23, 2013 at 11:13 AM, Matthew Walker wrote:
Max Pageviews in one Session: 141,882
zcat /a/squid/archive/mobile/mobile-sampled-100.tsv.log-20130421.gz | awk '{ print $5 }' | sort | uniq -c | sort -nr | head 7706 208.80.154.x 7523 208.80.154.x 7467 208.80.154.x 7133 208.80.154.x
(I censored the last octet on the off-chance that it is sensitive.) These are internal IPs. If they haven't been filtered out, they're probably causing the huge page view count.
-- Ori Livneh
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
After talking with Maryana yesterday and showing off Hive, we decided we're going to get her an account on the cluster so she can explore directly against the full dataset.
So, next steps: - We'll start the process with ops to get her shell access on the kraken machines; history teaches us this can take a while. - I'll modify the sessions job to drop the mega-tsv step and instead update a Hive table. - Then I'll enable the daily runs, and kick off a backfill starting March 1.
I'll update y'all when that's done.
-- David Schoonover dsc@wikimedia.org
On Wed, Apr 24, 2013 at 10:47 AM, David Schoonover dsc@wikimedia.orgwrote:
zcat /a/squid/archive/mobile/mobile-sampled-100.tsv.log-20130421.gz | awk
'{ print $5 }' | sort | uniq -c | sort -nr | head 7706 208.80.154.x 7523 208.80.154.x 7467 208.80.154.x 7133 208.80.154.x
I'm running a job to learn more about the sessions with the most pageviews so hopefully the mystery will be solved soon, but afaik the isPageview filter excludes hits that match our CIDR ranges (and it has tests). I'll certainly double-check it, as it's used everywhere. (Also, this dataset comes from the mobile varnishes, not the squids, fwiw.)
-- David Schoonover dsc@wikimedia.org
On Tue, Apr 23, 2013 at 2:29 PM, Ori Livneh ori@wikimedia.org wrote:
On Tuesday, April 23, 2013 at 11:13 AM, Matthew Walker wrote:
Max Pageviews in one Session: 141,882
zcat /a/squid/archive/mobile/mobile-sampled-100.tsv.log-20130421.gz | awk '{ print $5 }' | sort | uniq -c | sort -nr | head 7706 208.80.154.x 7523 208.80.154.x 7467 208.80.154.x 7133 208.80.154.x
(I censored the last octet on the off-chance that it is sensitive.) These are internal IPs. If they haven't been filtered out, they're probably causing the huge page view count.
-- Ori Livneh
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics