Analytics March 2013

analytics@lists.wikimedia.org

28 participants
23 discussions

Analytics Showcase Sprint ending March 20th, 2013

by Diederik van Liere

Hi! Here's the summary of Wednesday's Analytics sprint planning, demo and reflection. # TL;DR # Our most recent sprint continued our focus on improving visibility into mobile initiatives, including the mobile site, support for mobile applications, and Wikipedia Zero. In addition, we worked on solving the packetloss issues on Locke and Oxygen. ## Defects & Features taken during Sprint ending 2013-03-20 ## Bugzilla:#45178/Mingle:#129 (D) - Space characters in [pagecounst-raw] titles / Well-formed output of Webstatscollector . DONE (showcased on 2013-03-13) Mingle:#319 (F) - Deploy metrics-api. DONE (showcased on 2013-03-20) Mingle:#154 (F) - Provide unsampled blog webtraffic as datastream. SHIPPING (showcased on 2013-03-20) Mingle:#117 (F) - Visualize mobile page views by country and device. DONE (showcased on 2013-03-20) ## Planned for Showcase on 2013-03-27 ## Mingle:#61 (F) - Mobile Site Pageviews by Device Class Mingle:#68 (F) - Visualize Commons Mobile App (Android & iOS) metrics in Limn dashboard Mingle:#244 (F) - Track user adoption of Wikipedia Zero Mingle:#60 (F) - Mobile pageview requests reporting in wikistats ## Current Sprint (ending 2013-03-27) ## The current sprint's theme is still focused on Mobile. Stories in progress from last sprint: Mingle:#61 F - Mobile site pageviews by device class Mingle:#78 F - Document pageview business logic for analysts Mingle:#244 F - Track user adoption of Wikipedia Zero Stories started but blocked: Mingle:#60 D - Mobile pageview requests reporting in Wikistats New stories Mingle:#52 I - Puppetize Limn (N/E) Mingle:#92 F - Page View Metrics Report for Official Wikipedia Mobile Apps (5) Mingle:#148 I - Network ACL (N/E) Mingle:#155 F - Server maintenance to reduce packet loss (8) Mingle:#240 F - Session Analysis of mobile site visits by mode (alpha/beta/standard) (8) Mingle:#272 F - Dump stats: tally wikis by activity level (# active users) (Number in parentheses) = estimate of complexity N/E = not estimated; 148 will be done by Ops. F = Feature D = Defect I = Infrastructure Task Best, Diederik

11 years, 1 month

RFC: Tab as field delimiter in logging format of cache servers

by Diederik van Liere

Apologies for crossposting Heya, The Analytics Team is planning to deploy "tab as field delimiter" to replace the current space as fielddelimiter on the varnish/squid/nginx servers. We would like to do this on February 1st. The reason for this change is that we need to have a consistent number of fields in each webrequest log line. Right now, some fields contain spaces and that require a lot of post-processing cleanup and slows down the generation of reports. What is affected and maintained by Analytics * udp-filter already has support for the tab character * webstatscollector: we compiled a new version of filter to add support for the tab character * wikistats: we will fix the scripts on an ongoing basis. * udp2log: we have a patch ready for inserting sequence numbers separated by tab. In particular, I would like to have feedback to three questions: 1) Are there important reasons not to use tab as field delimiter? 2) Are there important pieces of logging that expect a space instead of a tab and that need to be fixed and that I did not mention in this email? 3) Is February 1st a good date to deploy this change? (Assuming that all preps are finished) Best, Diederik

11 years, 1 month

Re: [Analytics] [Wiki-research-l] data sources information page

by Jessie Wild

cross-posting to Analytics. (this is great, Mariya!) On Thu, Mar 21, 2013 at 5:47 AM, Maria Miteva <mariya.miteva(a)gmail.com>wrote: > Hi everyone, > > As part of my internship with WMF I have created > http://meta.wikimedia.org/wiki/Research:Data - a single page introduction > to Wikimedia-related data sources. Its intended to inform researchers abou > the variety of Wikimedia data available. > > The page can definitely benefit from some review from the actual users of > data. Please take a look and feel free to add or correct information. The > "How-to" and "Existing tools" subsections definitly could be expanded. > Also, http://meta.wikimedia.org/wiki/Research:Data/FAQ is still quite > small and can definitely be improved. > > My internship is over in a few days. I will still be around some but if > you see anything that can be improved please take the initiative and change > it. If you don't feel comfortable doing it, write a note on the Talk page. > > Finally, I would like to encourage you to share any Wikimedia-related > datasets you have or you know of, small or big, on > http://datahub.io/group/wikimedia. The aim is to eventually have all > Wikimedia-related data documented on the Wikimedia group on DataHub. > > Regards, > > Mariya > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > -- *Jessie Wild Learning & Evaluation * *Wikimedia Foundation* * * Imagine a world in which every single human being can freely share in the sum of all knowledge. Help us make it a reality! Donate to Wikimedia <https://donate.wikimedia.org/>

11 years, 1 month

FW: Breakdown of page views by subject?

by Navino Evans

Hi, I'm trying to find some page view stats for Wikipedia articles broken down by subject matter (like 'history', or 'science'). So I would like to be able to find out what fraction of total Wikipedia page views are for articles belonging to a particular category like 'History' or "The Arts". If it helps, an example of the kind of information I was hoping for can be found in this video of Jimmy Wales https://www.youtube.com/watch?v=IhumTKbmdFs (chart is shown at time 12:40). I would be really grateful for any suggestions on getting this kind of data or similar. Many thanks Navino

11 years, 1 month

case study with statistics: libraries & archives should share on Wikipedia

by Sumana Harihareswara

http://www.dlib.org/dlib/march13/szajewski/03szajewski.html "This case study examines the use of Wikipedia by the Ball State University Libraries as an opportunity to raise the visibility of digitized historic sheet music assets made available in the university's Digital Media Repository. By adding links to specific items in this collection to relevant, existing Wikipedia articles, Ball State successfully and efficiently expanded the user base of this collection in the Digital Media Repository by vastly enhancing the discoverability of the collection's assets... "The results of this study show that the addition of links from relevant Wikipedia articles to individual digitized assets in the Hague Sheet Music Collection in the Ball State University Digital Media Repository was an overwhelming success. Despite the fact that only 57 links to 40 assets were added to Wikipedia articles, pageviews for the collection of 149 assets roughly tripled as a result of this effort. The adding of links at the item level provided a plethora of highly-visible entry points to this collection's assets, raising awareness of the existence of these resources to interested Internet users who were previously unaware of these materials, as is suggested by the collection's use statistics. The success of this initiative is also remarkable in its efficiency, generating a large number of new digital patrons while requiring relatively little time to plan and execute." Includes an encouraging graph. :-) -- Sumana Harihareswara Engineering Community Manager Wikimedia Foundation

11 years, 1 month

Wikidata pageviews

by Denny Vrandečić

Any change of getting Wikidata in here? <http://stats.wikimedia.org/wikispecial/EN/TablesPageViewsMonthly.htm> Cheers, Denny -- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

11 years, 1 month

Building python .debs

by Andrew Otto

Yo friendles! Work in progress: https://wikitech.wikimedia.org/wiki/Git-buildpackage#How_to_build_a_Python_… Hashar and Ori and I (and probably others) need to do this more often these days. I've had relative success with python-flask-login and python-jsonschema doing this thus far, so I thought I'd write it up for future use. I betcha I'll end up changing the part about using stdeb to create the debian/ dir soon (*avoids Faidon's scowl*), but at least the bit about how to prep your repository and to use git-import-orig is good. (Or is it?) Tata! -way woah

11 years, 1 month

Stat variances over time

by Jan Ainali

How come the historical statistics vary over time? Last month I looked at the statistics for Swedish Wikipedia<http://stats.wikimedia.org/EN/TablesWikipediaSV.htm> and it had 132 new Wikipedians in December 2012 and 142 for November. When the January stats came now it says it was 139 in December and 144 in November. How is that possible? Two screendumps can be found on these links: https://dl.dropbox.com/u/8363895/WPSV%20Dec%202012.png https://dl.dropbox.com/u/8363895/WPSV%20Jan%202013.png And while I am asking, is this table available in .csv or .ods so we can play around with it easily? * Jan Ainali* Wikimedia Sverige <http://se.wikimedia.org/wiki/Huvudsida> 076-2122776

11 years, 1 month

Limn Puppet Module Review

by Andrew Otto

Hio, Would someone who feels powerful enough to gimme a second thumbs up review this, pretty please? https://gerrit.wikimedia.org/r/#/c/49710/ It has been tested in labs on a couple of instances. We're about to move reportcard.wmflabs.org to the instance that is using this module. Jeff has +1ed it, but said that I should get someone else to ok this as well. Faidon, are you the only one that can +2 new modules like this? Thanks! -Ao

11 years, 2 months

Kraken Status

by Andrew Otto

Hi guys! The Analytics team had the Kraken Arch Review with Mark and Faidon a couple of weeks ago. I wanted to summarize a few things here so that everyone was aware of the status of the analytics nodes. We defined 3 phases that the cluster has to go through before it is considered production cool. 1. Minimally Viable Cluster This is what we have now, described at http://www.mediawiki.org/wiki/Analytics/Kraken/Overview. analytics1001 has been reinstalled, but the other machines are still running unpuppetized Kraken stuff. The Analytics team has deliverables for this month. Reinstalling all of these nodes and repuppetizing (with review) before then would slow us down too much. analytics1010 (the Hadoop NameNode) access has been restricted with iptables, and Mark plans to set up network ACLs to restrict access from analytics nodes to the rest of the cluster soon (See: https://rt.wikimedia.org/Ticket/Display.html?id=4433 ). 2. Initial Base Cluster This is basically what we have now, but fully reviewed and puppetized. This is a transitional phase. This will not include Storm for ETL, and probably won't include using Kafka from the frontends. All analytics machines will be reinstalled before we consider this phase complete. We hope to get here in the next couple of months. 3. Production Cluster This is the ideal setup, including Storm, frontend Kafka Producers, Avro serialized data, fully automated Oozie jobs, etc. etc. In the meantime, there may be weird non-puppetized things on the remaining analytics nodes. (I'm referring to Leslie's email about the extra apt sources). If you notice anything like this, please don't hesitate to ask me, its probably something that isn't being used anyway. Thanks all! -Ao

11 years, 2 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics March 2013