Forwarding to mobile-l in case anyone is interested. Analytics for mobile including statistics around alpha and beta usage.
---------- Forwarded message ---------- From: David Schoonover dsc@wikimedia.org Date: Tue, Apr 23, 2013 at 2:45 AM Subject: Mobile Site Session Data (with Bonus Analytics!) To: mobile-tech mobile-tech@wikimedia.org, Analytics List analytics@lists.wikimedia.org Cc: Diederik van Liere dvanliere@wikimedia.org, Maryana Pinchuk mpinchuk@wikimedia.org, Arthur Richards arichards@wikimedia.org
Howdy all,
Now that the stars have aligned and: - The Mobile Frontend release with X-Analytics logging the site mode cookie (mf-m) has been out for a bit, and shows up in our logs; - We've deployed the patch to fix a critical Hadoop bug[1][2] that was blocking the job - I've personally "conquered" the wikiplague again (after being out half last week) - And finally, the script works and its results look valid and complete
...I'm happy to report the Mobile site sessions job[3] is ready to ship. I'm pretty sure this is the first view of mobile site sessions ever, so I was pretty excited. I've included some Bonus Stats from my test run which I generated 'cause I was curious :)
Before we get too excited, though, I've held off on enabling the daily job (as well as backfilling March) because it turns out that a day's worth of data generates about 16GB worth of sessions. This isn't a problem for the cluster, but we'd pretty rapidly compromise stat1001's public data storage with daily syncs. So to go forward, access to the data would probably have to be provided via private rsync. A third option is to work with the data on the cluster itself via any of the available tools; I've been using a SQL tool called Hive to validate various job runs and I can't say I'm missing MySQL. (If people are interested, I'd be happy to go over the options in more detail.)
So, we're looking for guidance on going forward. - Is the granular session output still the desired result, given the job's size? Current the job ends by coalescing the data into one giant TSV; instead it could generate a summary, or a selection of stats about the run. - If so: - Is it helpful to backfill March? - Does the data need to be publicly accessible via HTTP, or can we explore other options for providing access to the team?
I'm happy to answer any other questions as well.
Thanks!
Dave for Team Analytics
[1] The bug: https://issues.cloudera.org/browse/DISTRO-461 [2] The fix: https://mingle.corp.wikimedia.org/projects/analytics/cards/595 [3] Feature request: https://mingle.corp.wikimedia.org/projects/analytics/cards/240
---
BONUS STATS!
Notes: - As a reminder, a session (or "visit") is defined as all activity with less than 30 minutes between each hit. - The test job looked at all requests on 4/21, which is 75.17 GB of request logs. - It took ~17 minutes to process the day into 15.3 GB of sessions. (It then took 51m44s to concatenate those 28 files into one monstrous TSV for "ease" of delivery to y'all.) - The summary below took maybe 10 minutes to set up/write in Hive, and the job took maybe 7 minutes.
Visits to Mobile Site, 4/21/2013
- Total Visits: 51,624,103 - Unique Visitors: 37,736,120 - Total Pageviews: 104,972,033 - Avg Pageviews per Session: 2.0334 - Max Pageviews in one Session: 141,882
Standard Site - Visits: 51,603,221 - Unique Visitors: 37,723,188 - Pageviews: 104,910,382 - Avg Pageviews per Session: 2.033
Alpha Site - Visits: 986 - Unique Visitors: 822 - Pageviews: 7,087 - Avg Pageviews per Session: 7.188
Beta Site - Visits: 19,896 - Unique Visitors: 16,235 - Pageviews: 54,564 - Avg Pageviews per Session: 2.742
Those numbers look sane to you guys?
-- David Schoonover dsc@wikimedia.org
-- Jon Robson http://jonrobson.me.uk @rakugojon