Forwarding to mobile-l in case anyone is interested. Analytics for
mobile including statistics around alpha and beta usage.
---------- Forwarded message ----------
From: David Schoonover <dsc(a)wikimedia.org>
Date: Tue, Apr 23, 2013 at 2:45 AM
Subject: Mobile Site Session Data (with Bonus Analytics!)
To: mobile-tech <mobile-tech(a)wikimedia.org>rg>, Analytics List
Cc: Diederik van Liere <dvanliere(a)wikimedia.org>rg>, Maryana Pinchuk
<mpinchuk(a)wikimedia.org>rg>, Arthur Richards <arichards(a)wikimedia.org>
Now that the stars have aligned and:
- The Mobile Frontend release with X-Analytics logging the site mode
cookie (mf-m) has been out for a bit, and shows up in our logs;
- We've deployed the patch to fix a critical Hadoop bug that was
blocking the job
- I've personally "conquered" the wikiplague again (after being out
half last week)
- And finally, the script works and its results look valid and complete
...I'm happy to report the Mobile site sessions job is ready to
ship. I'm pretty sure this is the first view of mobile site sessions
ever, so I was pretty excited. I've included some Bonus Stats from my
test run which I generated 'cause I was curious :)
Before we get too excited, though, I've held off on enabling the daily
job (as well as backfilling March) because it turns out that a day's
worth of data generates about 16GB worth of sessions. This isn't a
problem for the cluster, but we'd pretty rapidly compromise stat1001's
public data storage with daily syncs. So to go forward, access to the
data would probably have to be provided via private rsync. A third
option is to work with the data on the cluster itself via any of the
available tools; I've been using a SQL tool called Hive to validate
various job runs and I can't say I'm missing MySQL. (If people are
interested, I'd be happy to go over the options in more detail.)
So, we're looking for guidance on going forward.
- Is the granular session output still the desired result, given the
job's size? Current the job ends by coalescing the data into one giant
TSV; instead it could generate a summary, or a selection of stats
about the run.
- If so:
- Is it helpful to backfill March?
- Does the data need to be publicly accessible via HTTP, or can we
explore other options for providing access to the team?
I'm happy to answer any other questions as well.
Dave for Team Analytics
 The bug: https://issues.cloudera.org/browse/DISTRO-461
 The fix: https://mingle.corp.wikimedia.org/projects/analytics/cards/595
 Feature request:
- As a reminder, a session (or "visit") is defined as all activity
with less than 30 minutes between each hit.
- The test job looked at all requests on 4/21, which is 75.17 GB of
- It took ~17 minutes to process the day into 15.3 GB of sessions. (It
then took 51m44s to concatenate those 28 files into one monstrous TSV
for "ease" of delivery to y'all.)
- The summary below took maybe 10 minutes to set up/write in Hive, and
the job took maybe 7 minutes.
Visits to Mobile Site, 4/21/2013
- Total Visits: 51,624,103
- Unique Visitors: 37,736,120
- Total Pageviews: 104,972,033
- Avg Pageviews per Session: 2.0334
- Max Pageviews in one Session: 141,882
- Visits: 51,603,221
- Unique Visitors: 37,723,188
- Pageviews: 104,910,382
- Avg Pageviews per Session: 2.033
- Visits: 986
- Unique Visitors: 822
- Pageviews: 7,087
- Avg Pageviews per Session: 7.188
- Visits: 19,896
- Unique Visitors: 16,235
- Pageviews: 54,564
- Avg Pageviews per Session: 2.742
Those numbers look sane to you guys?