Howdy all,

Now that the stars have aligned and:
- The Mobile Frontend release with X-Analytics logging the site mode cookie (mf-m) has been out for a bit, and shows up in our logs;
- We've deployed the patch to fix a critical Hadoop bug[1][2] that was blocking the job
- I've personally "conquered" the wikiplague again (after being out half last week)
- And finally, the script works and its results look valid and complete

...I'm happy to report the Mobile site sessions job[3] is ready to ship. I'm pretty sure this is the first view of mobile site sessions ever, so I was pretty excited. I've included some Bonus Stats from my test run which I generated 'cause I was curious :)

Before we get too excited, though, I've held off on enabling the daily job (as well as backfilling March) because it turns out that a day's worth of data generates about 16GB worth of sessions. This isn't a problem for the cluster, but we'd pretty rapidly compromise stat1001's public data storage with daily syncs. So to go forward, access to the data would probably have to be provided via private rsync. A third option is to work with the data on the cluster itself via any of the available tools; I've been using a SQL tool called Hive to validate various job runs and I can't say I'm missing MySQL. (If people are interested, I'd be happy to go over the options in more detail.)

So, we're looking for guidance on going forward.
- Is the granular session output still the desired result, given the job's size? Current the job ends by coalescing the data into one giant TSV; instead it could generate a summary, or a selection of stats about the run.
- If so:
  - Is it helpful to backfill March?
  - Does the data need to be publicly accessible via HTTP, or can we explore other options for providing access to the team?

I'm happy to answer any other questions as well.

Thanks!


Dave for Team Analytics


[1] The bug: https://issues.cloudera.org/browse/DISTRO-461
[2] The fix: https://mingle.corp.wikimedia.org/projects/analytics/cards/595
[3] Feature request: https://mingle.corp.wikimedia.org/projects/analytics/cards/240

---

BONUS STATS!

Notes:
- As a reminder, a session (or "visit") is defined as all activity with less than 30 minutes between each hit.
- The test job looked at all requests on 4/21, which is 75.17 GB of request logs.
- It took ~17 minutes to process the day into 15.3 GB of sessions. (It then took 51m44s to concatenate those 28 files into one monstrous TSV for "ease" of delivery to y'all.)
- The summary below took maybe 10 minutes to set up/write in Hive, and the job took maybe 7 minutes.


Visits to Mobile Site, 4/21/2013

- Total Visits: 51,624,103
- Unique Visitors: 37,736,120
- Total Pageviews: 104,972,033
- Avg Pageviews per Session: 2.0334
- Max Pageviews in one Session: 141,882

Standard Site
- Visits: 51,603,221
- Unique Visitors: 37,723,188
- Pageviews: 104,910,382
- Avg Pageviews per Session: 2.033

Alpha Site
- Visits: 986
- Unique Visitors: 822
- Pageviews: 7,087
- Avg Pageviews per Session: 7.188

Beta Site
- Visits: 19,896
- Unique Visitors: 16,235
- Pageviews: 54,564
- Avg Pageviews per Session: 2.742


Those numbers look sane to you guys?


--
David Schoonover
dsc@wikimedia.org