Here's the summary of Wednesday's Analytics sprint planning, demo and
# TL;DR #
Our most recent sprint continued our focus on improving visibility into
mobile initiatives, including the mobile site, support for mobile
applications, and Wikipedia Zero. In addition, we worked on solving the
packetloss issues on Locke and Oxygen.
## Defects & Features taken during Sprint ending 2013-03-20 ##
Bugzilla:#45178/Mingle:#129 (D) - Space characters in [pagecounst-raw]
titles / Well-formed output of Webstatscollector . DONE (showcased on
Mingle:#319 (F) - Deploy metrics-api. DONE (showcased on 2013-03-20)
Mingle:#154 (F) - Provide unsampled blog webtraffic as datastream. SHIPPING
(showcased on 2013-03-20)
Mingle:#117 (F) - Visualize mobile page views by country and device. DONE
(showcased on 2013-03-20)
## Planned for Showcase on 2013-03-27 ##
Mingle:#61 (F) - Mobile Site Pageviews by Device Class
Mingle:#68 (F) - Visualize Commons Mobile App (Android & iOS) metrics in
Mingle:#244 (F) - Track user adoption of Wikipedia Zero
Mingle:#60 (F) - Mobile pageview requests reporting in wikistats
## Current Sprint (ending 2013-03-27) ##
The current sprint's theme is still focused on Mobile.
Stories in progress from last sprint:
Mingle:#61 F - Mobile site pageviews by device class
Mingle:#78 F - Document pageview business logic for analysts
Mingle:#244 F - Track user adoption of Wikipedia Zero
Stories started but blocked:
Mingle:#60 D - Mobile pageview requests reporting in Wikistats
Mingle:#52 I - Puppetize Limn (N/E)
Mingle:#92 F - Page View Metrics Report for Official Wikipedia Mobile Apps
Mingle:#148 I - Network ACL (N/E)
Mingle:#155 F - Server maintenance to reduce packet loss (8)
Mingle:#240 F - Session Analysis of mobile site visits by mode
Mingle:#272 F - Dump stats: tally wikis by activity level (# active users)
(Number in parentheses) = estimate of complexity
N/E = not estimated; 148 will be done by Ops.
F = Feature
D = Defect
I = Infrastructure Task
Apologies for crossposting
The Analytics Team is planning to deploy "tab as field delimiter" to
replace the current space as fielddelimiter on the varnish/squid/nginx
servers. We would like to do this on February 1st. The reason for this
change is that we need to have a consistent number of fields in each
webrequest log line. Right now, some fields contain spaces and that require
a lot of post-processing cleanup and slows down the generation of reports.
What is affected and maintained by Analytics
* udp-filter already has support for the tab character
* webstatscollector: we compiled a new version of filter to add support for
the tab character
* wikistats: we will fix the scripts on an ongoing basis.
* udp2log: we have a patch ready for inserting sequence numbers separated
In particular, I would like to have feedback to three questions:
1) Are there important reasons not to use tab as field delimiter?
2) Are there important pieces of logging that expect a space instead of a
tab and that need to be fixed and that I did not mention in this email?
3) Is February 1st a good date to deploy this change? (Assuming that all
preps are finished)
cross-posting to Analytics.
(this is great, Mariya!)
On Thu, Mar 21, 2013 at 5:47 AM, Maria Miteva <mariya.miteva(a)gmail.com>wrote:
> Hi everyone,
> As part of my internship with WMF I have created
> http://meta.wikimedia.org/wiki/Research:Data - a single page introduction
> to Wikimedia-related data sources. Its intended to inform researchers abou
> the variety of Wikimedia data available.
> The page can definitely benefit from some review from the actual users of
> data. Please take a look and feel free to add or correct information. The
> "How-to" and "Existing tools" subsections definitly could be expanded.
> Also, http://meta.wikimedia.org/wiki/Research:Data/FAQ is still quite
> small and can definitely be improved.
> My internship is over in a few days. I will still be around some but if
> you see anything that can be improved please take the initiative and change
> it. If you don't feel comfortable doing it, write a note on the Talk page.
> Finally, I would like to encourage you to share any Wikimedia-related
> datasets you have or you know of, small or big, on
> http://datahub.io/group/wikimedia. The aim is to eventually have all
> Wikimedia-related data documented on the Wikimedia group on DataHub.
> Wiki-research-l mailing list
Learning & Evaluation *
Imagine a world in which every single human being can freely share in
the sum of all knowledge. Help us make it a reality!
Donate to Wikimedia <https://donate.wikimedia.org/>
I'm trying to find some page view stats for Wikipedia articles broken down
by subject matter (like 'history', or 'science'). So I would like to be
able to find out what fraction of total Wikipedia page views are for
articles belonging to a particular category like 'History' or "The Arts".
If it helps, an example of the kind of information I was hoping for can be
found in this video of Jimmy Wales
https://www.youtube.com/watch?v=IhumTKbmdFs (chart is shown at time 12:40).
I would be really grateful for any suggestions on getting this kind of data
"This case study examines the use of Wikipedia by the Ball State
University Libraries as an opportunity to raise the visibility of
digitized historic sheet music assets made available in the university's
Digital Media Repository. By adding links to specific items in this
collection to relevant, existing Wikipedia articles, Ball State
successfully and efficiently expanded the user base of this collection
in the Digital Media Repository by vastly enhancing the discoverability
of the collection's assets...
"The results of this study show that the addition of links from relevant
Wikipedia articles to individual digitized assets in the Hague Sheet
Music Collection in the Ball State University Digital Media Repository
was an overwhelming success. Despite the fact that only 57 links to 40
assets were added to Wikipedia articles, pageviews for the collection of
149 assets roughly tripled as a result of this effort. The adding of
links at the item level provided a plethora of highly-visible entry
points to this collection's assets, raising awareness of the existence
of these resources to interested Internet users who were previously
unaware of these materials, as is suggested by the collection's use
statistics. The success of this initiative is also remarkable in its
efficiency, generating a large number of new digital patrons while
requiring relatively little time to plan and execute."
Includes an encouraging graph. :-)
Engineering Community Manager
Any change of getting Wikidata in here?
Project director Wikidata
Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/681/51985.
Work in progress:
Hashar and Ori and I (and probably others) need to do this more often these days. I've had relative success with python-flask-login and python-jsonschema doing this thus far, so I thought I'd write it up for future use.
I betcha I'll end up changing the part about using stdeb to create the debian/ dir soon (*avoids Faidon's scowl*), but at least the bit about how to prep your repository and to use git-import-orig is good.
(Or is it?)
Would someone who feels powerful enough to gimme a second thumbs up review this, pretty please?
It has been tested in labs on a couple of instances. We're about to move reportcard.wmflabs.org to the instance that is using this module. Jeff has +1ed it, but said that I should get someone else to ok this as well.
Faidon, are you the only one that can +2 new modules like this?
The Analytics team had the Kraken Arch Review with Mark and Faidon a couple of weeks ago. I wanted to summarize a few things here so that everyone was aware of the status of the analytics nodes.
We defined 3 phases that the cluster has to go through before it is considered production cool.
1. Minimally Viable Cluster
This is what we have now, described at http://www.mediawiki.org/wiki/Analytics/Kraken/Overview. analytics1001 has been reinstalled, but the other machines are still running unpuppetized Kraken stuff. The Analytics team has deliverables for this month. Reinstalling all of these nodes and repuppetizing (with review) before then would slow us down too much. analytics1010 (the Hadoop NameNode) access has been restricted with iptables, and Mark plans to set up network ACLs to restrict access from analytics nodes to the rest of the cluster soon (See: https://rt.wikimedia.org/Ticket/Display.html?id=4433 ).
2. Initial Base Cluster
This is basically what we have now, but fully reviewed and puppetized. This is a transitional phase. This will not include Storm for ETL, and probably won't include using Kafka from the frontends. All analytics machines will be reinstalled before we consider this phase complete. We hope to get here in the next couple of months.
3. Production Cluster
This is the ideal setup, including Storm, frontend Kafka Producers, Avro serialized data, fully automated Oozie jobs, etc. etc.
In the meantime, there may be weird non-puppetized things on the remaining analytics nodes. (I'm referring to Leslie's email about the extra apt sources). If you notice anything like this, please don't hesitate to ask me, its probably something that isn't being used anyway.