Analytics May 2015

analytics@lists.wikimedia.org

41 participants
32 discussions

stats.grok.se not updating
by Vipul Naik 28 May '15

28 May '15

I just noticed that stats.grok.se doesn't have any data beyond Saturday May 23. Wondering if Henrik or others know what the issue is (are the Wikimedia dumps not up-to-date, or has stats.grok.se not been running the updating scripts?) Vipul

2 1

EventLogging issues 2015-05-06
by Marcel Ruiz Forns 25 May '15

25 May '15

EventLogging suffered from performance problems and data loss from Tuesday 2015-05-05 22:00 UTC to Wednesday 2015-05-06 20:00 UTC (22 hours). During that period, an exceptional amount of events were sent to EL server for a given schema. The system could not handle them properly, and this caused data loss (30%-40% during the period) and some small gaps in the db. All schemas were affected. The missing data will be backfilled during this week. Phab Task: https://phabricator.wikimedia.org/T98588 Incident documentation: https://wikitech.wikimedia.org/wiki/Incident_documentation/20150506-EventLo… Cheers, Marcel

4 4

clicks on red links
by Amir E. Aharoni 23 May '15

23 May '15

Hi, Are there statistics about the number of people who click on red links in Wikimedia projects? And about what they do as the next step - go back, close the page, create an article, something else? -- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com ‪“We're living in pieces, I want to live in peace.” – T. Moore‬

6 5

The awful truth about Wikimedia's article counts
by Dario Taraborelli 23 May '15

23 May '15

From this week’s Signpost, worth reading: https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2015-05-20/In_fo… this is a great illustration of why we need stateless, historically and globally consistent measurements to report the growth of Wikimedia projects (and particularly why the legacy definition of a “countable” article is ridiculously problematic): https://meta.wikimedia.org/wiki/Research:Refining_the_definition_of_monthly… https://meta.wikimedia.org/wiki/Research:Metrics_standardization Dario

2 3

Fwd: [Maniphest] [Commented On] T44259: Make domas' pageviews data available in semi-publicly queryable database format
by Dario Taraborelli 22 May '15

22 May '15

Dan – thanks for the thorough update, hope you don’t mind if I repost this to the analytics list – I bet several people on this list are eager to know where this is going. Dario Begin forwarded message: > > From: Milimetric <no-reply(a)phabricator.wikimedia.org> > Subject: [Maniphest] [Commented On] T44259: Make domas' pageviews data available in semi-publicly queryable database format > Date: May 21, 2015 at 9:31:36 AM PDT > To: dario(a)wikimedia.org > Reply-To: T44259+public+a4a5010c21d15736(a)phabricator.wikimedia.org > > Milimetric added a comment. > > I'd love to start a more open discussion about our progress on this. Here's the recent history and where we are: > > February 2015: with data flowing into the Hadoop cluster, we defined which raw webrequests were "page views". The research is here <https://meta.wikimedia.org/wiki/Research:Page_view> and the code is here <https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery…> > March 2015: we used this page view definition to create a raw pageview table in Hadoop. This is queryable by Hive but it's about 3 TB per day of data. So we don't have the resources to expose it publicly > April 2015: we used this data internally to query but it overloaded our cluster and queries were slow > May 2015: we're working on an intermediate aggregation that would total up page counts by hour over the dimensions that we think most people care about. We estimate this will cut down size by a factor of 50 > Progress has been slow mostly because Event Logging is our main priority and it's been having serious scaling issues. We think we have a good handle on the Event Logging issues after our latest patch, and in a week or so we're going to mostly focus on the Pageview API. > > Once this new intermediate aggregation is done, we'll hopefully free up some cluster resources and be in a better position to load up a public API. Right now, we are evaluating two possible data pipelines: > > Pipeline 1: > > Put daily aggregates into PostgreSQL. We think per article hourly data would be too big for PostgreSQL. > Pipeline 2: > > Query data from the Hive tables directly with Impala. Impala is good for medium to small data, but is much faster than Hive. We might be able to query the hourly data if we use this method. > Common Pipeline after we make the choice above: > > Mondrian builds OLAP cubes and handles caching which is very useful with this much data > point RESTBase to Mondrian and expose API publicly at restbase.wikimedia.org. This will be a reliable public API that people can build tools around > point Saiku to Mondrian and make a new public website for exploratory analytics. Saiku is an open source OLAP cube visualization and analysis tool > Hope that helps. As we get closer to making this API real, we would love your input, participation, questions, etc. > > > TASK DETAIL > https://phabricator.wikimedia.org/T44259 <https://phabricator.wikimedia.org/T44259> > EMAIL PREFERENCES > https://phabricator.wikimedia.org/settings/panel/emailpreferences/ <https://phabricator.wikimedia.org/settings/panel/emailpreferences/> > To: Milimetric > Cc: Daniel_Mietchen, PKM, jeremyb, Arjunaraoc, Mr.Z-man, Tbayer, Elitre, scfc, Milimetric, Legoktm, drdee, Nemo_bis, Tnegrin, -jem-, DarTar, jayvdb, Aubrey, Ricordisamoa, MZMcBride, Magnus, MrBlueSky, Multichill

3 4

Re: [Analytics] top articles across languages for testing?
by Brian Gerstle 20 May '15

20 May '15

+analytics On Tue, May 19, 2015 at 3:23 PM, Brian Gerstle <bgerstle(a)wikimedia.org> wrote: > +search > > On Tue, May 19, 2015 at 3:14 PM, Brian Gerstle <bgerstle(a)wikimedia.org> > wrote: > >> The subject hints at a question that's been nagging me for a while, and >> now that I'm going to be hacking on testing in Lyon I wanted to ask: >> >> Do we have a list of articles we usually run tests against? >> >> If not, do we have any processes for curating such a list? Would anyone >> be interested in a brainstorming session at Lyon to discuss this further? >> >> Basically, as a developer, I would love to have more confidence that some >> code I wrote doesn't break on our most popular articles. Or, if we can get >> more sophisticated, that *certain properties of my code hold true for >> certain kinds of generated pages*.* >> >> Please respond with your thoughts and whether you think I should create a >> phab task for the hackathon about this. In either case, ping me anytime or >> grab me at Lyon to discuss further! >> >> Regards, >> >> Brian >> >> * Yes, I'm talking about using property-based testing generators to >> create random, shrinkable MW pages that we can run tests on. Not sure if >> it's practical, but could be an interesting experiment. >> >> -- >> EN Wikipedia user page: https://en.wikipedia.org/wiki/User:Brian.gerstle >> IRC: bgerstle >> > > > > -- > EN Wikipedia user page: https://en.wikipedia.org/wiki/User:Brian.gerstle > IRC: bgerstle > -- EN Wikipedia user page: https://en.wikipedia.org/wiki/User:Brian.gerstle IRC: bgerstle

3 2

Geo-aggregation of Wikipedia page views: Maximizing geographic granularity while preserving privacy – a proposal
by Dario Taraborelli 14 May '15

14 May '15

I’m sharing a proposal that Reid Priedhorsky and his collaborators at Los Alamos National Laboratory recently submitted to the Wikimedia Analytics Team aimed at producing privacy-preserving geo-aggregates of Wikipedia pageview data dumps and making them available to the public and the research community. [1] Reid and his team spearheaded the use of the public Wikipedia pageview dumps to monitor and forecast the spread of influenza and other diseases, using language as a proxy for location. This proposal describes an aggregation strategy adding a geographical dimension to the existing dumps. Feedback on the proposal is welcome on the lists or the project talk page on Meta [3] Dario [1] https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pagev… [2] http://dx.doi.org/10.1371/journal.pcbi.1003892 [3] https://meta.wikimedia.org/wiki/Research_talk:Geo-aggregation_of_Wikipedia_…

16 47

need correction range for converting stats.grok.se to total views including mobile...
by Lane Rasberry 14 May '15

14 May '15

Hello, I am publishing a paper reporting the impact of distributing information on Wikipedia. One of the values which I am reporting is the pageviews of a set of English Wikipedia articles as measured from 2013 to 2015. I get pageviews from stats.grok.se. As I understand, numbers there have not always included mobile device pageviews. What is best estimate of the count of mobile device pageviews as can be derived from the stats.grok.se pageview count? I think that I read somewhere for this range, mobile device pageviews have been supposed to be anywhere from 40% of the grok.se views to 120% of that value. What is the most reasonable range to report for mobile device pageviews of English Wikipedia articles from 2013-2015? Is 40-120% of the stats.grok.se report the most reasonable range to report? I need to report something. If there is any precedent for expressing this somewhere then I would like to follow the precedent and cite whatever paper described it. Thanks, -- Lane Rasberry user:bluerasberry on Wikipedia 206.801.0814 lane(a)bluerasberry.com

4 3

May 2015 research showcase
by Leila Zia 14 May '15

14 May '15

Hi everyone, The next research showcase will be live-streamed this Wednesday, May 13 at 11.30 PT. The streaming link will be posted on the lists a few minutes before the showcase starts and as usual, you can join the conversation on IRC at #wikimedia-research. We look forward to seeing you! Leila This month *The people's classifier: Towards an open model for algorithmic infrastructure* By Aaron Halfaker <https://www.mediawiki.org/wiki/User:Halfak_(WMF)> Recent research has implicated that Wikipedia's algorithmic infrastructure is perpetuating social issues. However, these same algorithmic tools are critical to maintaining efficiency of open projects like Wikipedia at scale. But rather than simply critiquing algorithmic wiki-tools and calling for less algorithmic infrastructure, I'll propose a different strategy -- an open approach to building this algorithmic infrastructure. In this presentation, I'll demo a set of services that are designed to open a critical part Wikipedia's quality control infrastructure -- machine classifiers. I'll also discuss how this strategy unites critical/feminist HCI with more dominant narratives about efficiency and productivity. *Social transparency online* By Jennifer Marlow <http://www.aboutjmarlow.com/> and Laura Dabbish <http://www.lauradabbish.com/> An emerging Internet trend is greater social transparency, such as the use of real names in social networking sites, feeds of friends' activities, traces of others' re-use of content, and visualizations of team interactions. There is a potential for this transparency to radically improve coordination, particularly in open collaboration settings like Wikipedia. In this talk, we will describe some of our research identifying how transparency influences collaborative performance in online work environments. First, we have been studying professional social networking communities. Social media allows individuals in these communities to create an interest network of people and digital artifacts, and get moment-by-moment updates about actions by those people or changes to those artifacts. It affords and unprecedented level of transparency about the actions of others over time. We will describe qualitative work examining how members of these communities use transparency to accomplish their goals. Second, we have been looking at the impact of making workflows transparent. In a series of field experiments we are investigating how socially transparent interfaces, and activity trace information in particular, influence perceptions and behavior towards others and evaluations of their work.

2 1

killed a lot of queries on analytics-store
by Sean Pringle 13 May '15

13 May '15

4 5

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics May 2015