Hi all,
as you might know, I have a few GLAM-related tools on the toolserver. Some
are updated once a month, some can be used live, but all are in high demand
by GLAM institutions.
Now, the monthly updated stats have always been slow to run, but did almost
grind to a halt recently. The on-demand tools have stalled completely.
All these tools get their data from stats.grok.se, which works well but not
really high-speed; my on-demand tools have apparently been shut out
recently because too many people were using them, DDOSing the server :-(
I know you are working on page view numbers, and for what I gather it's
up-and-running internally already. My requirements are simple: I have a
list of pages on many Wikimedia projects; I need view counts for these
pages for a specific month, per-page.
Now, I know that there is no public API yet, but is there any way I can get
to the data, at least for the monthly stats?
Cheers,
Magnus
fyi
-------- Original Message --------
Subject: Special Tech Talk: Big Data Tools
Date: Mon, 29 Apr 2013 13:19:03 -0700
From: Quim Gil <qgil(a)wikimedia.org>
Organization: Wikimedia Foundation
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
This week we have a Tech Talk special edition:
Big Data Tools
by David Schoonover
May 1, 19:30 UTC / 12:30 PDT
Overview of the big data tools we have available at the Wikimedia
Foundation. We'll be writing real queries to explore real data from the
mobile site!
More details and timezones at
http://www.mediawiki.org/wiki/Meetings/2013-05-01
--
Quim Gil
Technical Contributor Coordinator @ Wikimedia Foundation
http://www.mediawiki.org/wiki/User:Qgil
Over the last week I've created hive tables for many of our larger datasets
in Hadoop. Those were used to generate many of the results you've seen in
the last few days.
Both the schemas for those tables and the job-scripts can be found in:
- https://github.com/wikimedia/kraken/tree/master/hive
Questions welcome.
--
David Schoonover
dsc(a)wikimedia.org
forwarded from Yuri Astrakhan:
Don't panic. I should have explained it at the meeting: the config change
at this point is PHP only - no change in varnish code just yet. Which
means, the X-CS header is set only for the zero-rated requests for now, so
analytics should continue investigating the pending questions in
EtherPad<http://etherpad.wikimedia.org/x-cs>
.
BUT! We do want to make the transition to X-CS being set on ALL incoming
requests from known carriers once everything else checks out, so please let
us know when we can start doing that. The plan is:
* deploy Zero config pages and manage all the banner logic from there
* add PHP code to determine if the banner should be shown or not based on
the zero settings, not the X-CS presence.
* Get an OK from Analytics that everything is OK
* change varnish files to ALWAYS include X-CS from known carriers (btw,
fundraiser team really wants this data - they think our IP ranges are
better than GeoIP)
* Finally, get OPs to switch over from ACL-based to custom lookup based IP
-> X-CS mapping.
See also:
* Original specification:
RFC<http://www.mediawiki.org/wiki/Requests_for_comment/Zero_Architecture>
.
* RT Ticket <https://rt.wikimedia.org/Ticket/Display.html?id=4881> is
tracking OPs progress on implementing the IP to X-CS mapping function
Hiya all,
As promised earlier today in the Analytics weekly showcase, I've got a few
interesting bits of data to share from playing with the new Mobile Site
Sessions dataset.
# Visits to Mobile Site, 4/21/2013
- Total Visits: 51,624,103
- Unique Visitors: 37,736,120
- Total Pageviews: 104,972,033
- Avg Pageviews per Session: 2.0334
- Max Pageviews in one Session: 141,882
## Standard Site
- Visits: 51,603,221
- Unique Visitors: 37,723,188
- Pageviews: 104,910,382
- Avg Pageviews per Session: 2.033
## Alpha Site
- Visits: 986
- Unique Visitors: 822
- Pageviews: 7,087
- Avg Pageviews per Session: 7.188
## Beta Site
- Visits: 19,896
- Unique Visitors: 16,235
- Pageviews: 54,564
- Avg Pageviews per Session: 2.742
## Notes
- A session (or "visit") is defined as all activity with less than 30
minutes between each hit. Intuitively speaking, a session ends when the
user hasn't done anything in 30m.
- As we do not set visitor_id cookies for all users, the "unique visitors"
metric was calculated using hash(ip_address + users_agent) as visitor_id.
- This job looked at all requests to the mobile site on 4/21/2013, which is
75.17 GB of request logs.
- The job took ~17 minutes to process the day into 15.3 GB of sessions.
- The summary above took maybe 10 minutes to set up/write in Hive, and the
job took maybe 7 minutes.
In addition to that summary, I ran a few jobs on the entry_referer field --
the URL that referred the user to us when the session started. Obvious
caveats: this is only one day of data, and it's only the mobile site. Draw
conclusions with care.
First, I pulled out the top referring domains. It's mostly as you'd expect
-- search engines -- though you'll also note that several Wikipedia mobile
sites show up. My working hypothesis is that people don't tend to close
tabs on smartphones; when they later come back, it is often to an open
Wikipedia tab: clicking a link or perform a search means the referrer is
still us.
Since -- as expected -- so much of the data pertained to search engines, I
also calculated the top search queries and top keywords that sent people to
us. (For keywords, I've filtered out common "stop words": de, of, in, is,
la, and, el, es, to, en, di, los, le, da, se, las, les, il, du, a, i, o, y,
e.) In both, you see the predictable: lots of searches for porn, for
"facebook", for "wiki", etc. But you also see a few things that surprised
me:
- Tons of Japanese. Japan is the most mobile-enabled country in the world
so I guess we should have expected to see many searches in Japanese show up
in the top queries. I've left them URL-encoded in the results -- you'll see
them as weird lines with % in them.
- Apparently people search for movies and TV so they can spoil their fun by
reading about them on Wikipedia. Both of "movies" and "film" show up in the
top keywords; Iron Man 1, 2, AND 3 all show up in the top search queries. I
didn't expect this was a major use-case, but -- wikigroaning aside -- it's
an interesting fact.
I'm sure we're only scratching the surface here. This is an exciting
dataset, and I'm sure there's lots more to learn!
The full results:
- Top Referring Entry Domains:
http://stats.wikimedia.org/kraken-public/webrequest/mobile/views/sessions/m…
- Top Referring Entry Search Queries:
http://stats.wikimedia.org/kraken-public/webrequest/mobile/views/sessions/m…
- Top Referring Entry Search Keywords:
http://stats.wikimedia.org/kraken-public/webrequest/mobile/views/sessions/m…
Questions are welcome!
--
David Schoonover
dsc(a)wikimedia.org
Hi!
Wednesday's sprint demo concluded the third sprint of the "Self-Serve
Observational Analytics" Release. The goal of this release is to schedule
features that will empower end-users to interact independently with the
Analytics toolset.
Apologies for cross-posting; ideally you should receive this on the
Analytics Mailinglist so we can have one focal point for conversation. If
you are not on the Analytics list then please subscribe at
https://lists.wikimedia.org/mailman/listinfo/analytics
## Defects & Features completed (Ready for Showcase/Shipping/Done) during
Sprint ending 2013-04-24 ##
#92 F - Page View Metrics report for Official Wikipedia Mobile Apps (5)
Ready for Showcase requested by Mobile (Tomasz & Brion)
#240 F - Session Analysis of Mobile Site Visits by Mode
(Alpha/Beta/Standard) (8) Ready for Showcase requested by Mobile (Maryana)
#518 I - Setup SSL for User Metrics (3) Ready for Showcase requested by
Analytics & E3
#614 F - Historical Wikipedia Zero Provider / Country Counts (N/E) Ready
for Showcase requested by Wikipedia Zero (Amit)
#60 F - Mobile pageview requests reporting in wikistats (N/E) Done
requested by Mobile (Tomasz)
#579 D - Migrate the dandreescu user to milimetric on the production
cluster (N/E) Done requested by Analytics
#595 D - Kraken - ClassCastException in CDH 4.2 (N/E) Done requested by
Analytics
## Planned for Showcase on 2013-05-01 ##
#388 F - Admin defines new static cohort by uploading CSV (5)
#570 I - Local dev env for User Metrics (8)
## Current Sprint (ending 2013-04-24) ##
Stories in progress from last sprint:
#148 I - Network ACL (N/E) BLOCKED requested by Ops/Mark
#131 I - Puppetize Kafka 0.7 (8) Coding requested by Analytics & Ops
#244 F - Track user adoption of Wikipedia Zero (N/E) Testing requested by
Wikipedia Zero/Amit
New stories
#134 I - Puppetize Hadoop CDH4 (13)
#388 F - Admin defines new static cohort by uploading CSV (5)
#570 I - Local dev env for User Metrics (8)
#353 S - Wikistats - mobile country report et al. (N/E)
(Number in parentheses) = estimate of complexity
N/E = not estimated;
F = Feature
D = Defect
I = Infrastructure Task
S = Spike
Any mingle card can be accessed using the base url
https://mingle.corp.wikimedia.org/projects/analytics/cards/XYZ where XYZ is
the Mingle card id.
If you have any questions, comments or feedback: please let us know!
Best,
Diederik
Now ClickTracking is retired, we don't need Extension:UserDailyContribs.
wmf-config/InitialiseSettings.php says
'wmgUserDailyContribs' => array(
// Actively used by researchers and analysts.
// Contact person: Dario Taraborelli <dtaraborelli(a)wikimedia.org>
'default' => true,
),
Is that true? Is anyone querying the user_daily_contribs table and its
counts of contribs by user_id by day?
--
=S Page software engineer on E3
Howdy all,
Now that the stars have aligned and:
- The Mobile Frontend release with X-Analytics logging the site mode cookie
(mf-m) has been out for a bit, and shows up in our logs;
- We've deployed the patch to fix a critical Hadoop bug[1][2] that was
blocking the job
- I've personally "conquered" the wikiplague again (after being out half
last week)
- And finally, the script works and its results look valid and complete
...I'm happy to report the Mobile site sessions job[3] is ready to ship.
I'm pretty sure this is the first view of mobile site sessions ever, so I
was pretty excited. I've included some Bonus Stats from my test run which I
generated 'cause I was curious :)
Before we get too excited, though, I've held off on enabling the daily job
(as well as backfilling March) because it turns out that a day's worth of
data generates about 16GB worth of sessions. This isn't a problem for the
cluster, but we'd pretty rapidly compromise stat1001's public data storage
with daily syncs. So to go forward, access to the data would probably have
to be provided via private rsync. A third option is to work with the data
on the cluster itself via any of the available tools; I've been using a SQL
tool called Hive to validate various job runs and I can't say I'm missing
MySQL. (If people are interested, I'd be happy to go over the options in
more detail.)
So, we're looking for guidance on going forward.
- Is the granular session output still the desired result, given the job's
size? Current the job ends by coalescing the data into one giant TSV;
instead it could generate a summary, or a selection of stats about the run.
- If so:
- Is it helpful to backfill March?
- Does the data need to be publicly accessible via HTTP, or can we
explore other options for providing access to the team?
I'm happy to answer any other questions as well.
Thanks!
Dave for Team Analytics
[1] The bug: https://issues.cloudera.org/browse/DISTRO-461
[2] The fix: https://mingle.corp.wikimedia.org/projects/analytics/cards/595
[3] Feature request:
https://mingle.corp.wikimedia.org/projects/analytics/cards/240
---
BONUS STATS!
Notes:
- As a reminder, a session (or "visit") is defined as all activity with
less than 30 minutes between each hit.
- The test job looked at all requests on 4/21, which is 75.17 GB of request
logs.
- It took ~17 minutes to process the day into 15.3 GB of sessions. (It then
took 51m44s to concatenate those 28 files into one monstrous TSV for "ease"
of delivery to y'all.)
- The summary below took maybe 10 minutes to set up/write in Hive, and the
job took maybe 7 minutes.
Visits to Mobile Site, 4/21/2013
- Total Visits: 51,624,103
- Unique Visitors: 37,736,120
- Total Pageviews: 104,972,033
- Avg Pageviews per Session: 2.0334
- Max Pageviews in one Session: 141,882
Standard Site
- Visits: 51,603,221
- Unique Visitors: 37,723,188
- Pageviews: 104,910,382
- Avg Pageviews per Session: 2.033
Alpha Site
- Visits: 986
- Unique Visitors: 822
- Pageviews: 7,087
- Avg Pageviews per Session: 7.188
Beta Site
- Visits: 19,896
- Unique Visitors: 16,235
- Pageviews: 54,564
- Avg Pageviews per Session: 2.742
Those numbers look sane to you guys?
--
David Schoonover
dsc(a)wikimedia.org
"Datavisualization.ch Selected Tools is a collection of tools that we, the people behind Datavisualization.ch, work with on a daily basis and recommend warmly. This is not a list of everything out there, but instead a thoughtfully curated selection of our favourite tools that will make your life easier creating meaningful and beautiful data visualizations."
I presume many of these are familiar to regulars of this list, but perhaps not all -- many were new to me.
http://selection.datavisualization.ch/
--
Ori Livneh