Analytics June 2015

analytics@lists.wikimedia.org

34 participants
28 discussions

Re: [Analytics] Fwd: Wikipedia Page views access
by Ariel T. Glenn 20 Jun '15

20 Jun '15

Yes, we connection limit and bandwidth cap. Two or three simultaneous connections tops. Ariel Στις 19-06-2015, ημέρα Παρ, και ώρα 13:27 -0400, ο/η Ashok Rao έγραψε: > Dario, thanks much. > > Ariel, Kevin – I think the 503 errors were a problem because I was > running the grab process on parallel threads. It seems to work fine > in serial – the only trouble with this being, of course, the time it > takes to download all the available data. > > Best, > Ashok > > On Fri, Jun 19, 2015 at 12:17 PM, Kevin Leduc <kevin(a)wikimedia.org> > wrote: > > + Ariel > > > > Hi Ariel, can you comment on the 503 errors happening sometimes > > while trying to download data from the dumps? > > > > > > > > On Fri, Jun 19, 2015 at 1:01 AM, Dario Taraborelli < > > dtaraborelli(a)wikimedia.org> wrote: > > > Forwarding a note from Ashok Rao (cc’ed), can anyone comment on > > > the dumps server returning 503s? > > > > > > Ashok – we don’t have yet an in-house API to retrieve pageview > > > data, but the Analytics team is working on one: see this thread. > > > Depending on what you’re doing, http://stats.grok.se/ may also > > > come in handy. > > > > > > Best, > > > Dario > > > > > > > Begin forwarded message: > > > > > > > > From: Ashok Rao <raoashok(a)seas.upenn.edu> > > > > Subject: Wikipedia Page views access > > > > Date: June 18, 2015 at 5:53:12 PM GMT+2 > > > > To: dario(a)wikimedia.org > > > > > > > > Hi Dario, > > > > > > > > Good morning. I'm a student at the University of Pennsylvania > > > > and I've been trying to perform a few analyses based on > > > > Wikipedia page views data. I've written a script that grabs > > > > data from the main dump site – > > > > https://dumps.wikimedia.org/other/pagecounts-raw/ – but run > > > > into many sporadic 503 errors (sometimes with the download > > > > link, other times with the main page itself). I noticed some of > > > > this data might be available directly on Wikimedia servers that > > > > can be utilized for research purposes. > > > > > > > > I was hoping I could get access to this and appreciate your > > > > help. > > > > > > > > Best, > > > > Ashok > > > > > > > > -- > > > > Ashok M. Rao > > > > The Rajendra and Neera Singh Program in Market and Social > > > > Systems Engineering > > > > School of Engineering and Applied Sciences > > > > University of Pennsylvania | Class of '17 > > > > > > _______________________________________________ > > > Analytics mailing list > > > Analytics(a)lists.wikimedia.org > > > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > > > >

1 0

Fwd: Wikipedia Page views access
by Dario Taraborelli 19 Jun '15

19 Jun '15

Forwarding a note from Ashok Rao (cc’ed), can anyone comment on the dumps server returning 503s? Ashok – we don’t have yet an in-house API to retrieve pageview data, but the Analytics team is working on one: see this thread <https://phabricator.wikimedia.org/T44259#1341010>. Depending on what you’re doing, http://stats.grok.se/ <http://stats.grok.se/> may also come in handy. Best, Dario > Begin forwarded message: > > From: Ashok Rao <raoashok(a)seas.upenn.edu> > Subject: Wikipedia Page views access > Date: June 18, 2015 at 5:53:12 PM GMT+2 > To: dario(a)wikimedia.org > > Hi Dario, > > Good morning. I'm a student at the University of Pennsylvania and I've been trying to perform a few analyses based on Wikipedia page views data. I've written a script that grabs data from the main dump site – https://dumps.wikimedia.org/other/pagecounts-raw/ <https://dumps.wikimedia.org/other/pagecounts-raw/> – but run into many sporadic 503 errors (sometimes with the download link, other times with the main page itself). I noticed some of this data might be available directly on Wikimedia servers that can be utilized for research purposes. > > I was hoping I could get access to this and appreciate your help. > > Best, > Ashok > > -- > Ashok M. Rao > The Rajendra and Neera Singh Program in Market and Social Systems Engineering > School of Engineering and Applied Sciences > University of Pennsylvania | Class of '17

2 1

Contribute
by Ron Baasland 17 Jun '15

17 Jun '15

Hello, My username is rbaasland and I would like to contribute to the analytics project. I was wondering if I could have access to the project, or how I go about contributing to this project? Thank you very much, Ron Baasland

3 3

[Technical] Pick storage for pageview cubes
by Marcel Ruiz Forns 17 Jun '15

17 Jun '15

*This discussion is intended to be a branch of the thread: "[Analytics] Pageview API Status update".* Hi all, We Analytics are trying to *choose a storage technology to keep the pageview data* for analysis. We don't want to get to a final system that covers all our needs yet (there are still things to discuss), but have something *that implements the current stats.grok.se <http://stats.grok.se> functionalities* as a first step. This way we can have a better grasp of which will be our difficulties and limitations regarding performance and privacy. The objective of this thread is to *choose 3 storage technologies*. We will later setup an fill each of them with 1 day of test data, evaluate them and decide which one of them we will go for. There are 2 blocks of data to be stored: 1. *Cube that represents the number of pageviews broken down by the following dimensions*: - day/hour (size: 24) - project (size: 800) - agent type (size: 2) To test with an initial level of anonymity, all cube cells whose value is less than k=100 have an undefined value. However, to be able to retrieve aggregated values without loosing that undefined counts, all combinations of slices and dices are precomputed before anonymization and belong to the cube, too. Like this: dim1, dim2, dim3, ..., dimN, val a, null, null, ..., null, 15 // pv for dim1=a a, x, null, ..., null, 34 // pv for dim1=a & dim2=x a, x, 1, ..., null, 27 // pv for dim1=a & dim2=x & dim3=1 a, x, 1, ..., true, undef // pv for dim1=a & dim2=x & ... & dimN=true So the size of this dataset would be something between 100M and 200M records per year, I think. 1. *Timeseries dataset that stores the number of pageviews per article in time with*: - maximum resolution: hourly - diminishing resolution over time is accepted if there are performance problems article (dialect.project/article), day/hour, value en.wikipedia/Main_page, 2015-01-01 17, 123456 en.wiktionary/Bazinga, 2015-01-02 13, 23456 It's difficult to calculate the size of that. How many articles do we have? 34M? But not all of them will have pageviews every hour... *Note*: I guess we should consider that the storage system will presumably have high volume batch inserts every hour or so, and queries that will be a lot more frequent but also a lot lighter in data size. And that is that. *So please, feel free to suggest storage technologies, comment, etc!* And if there is any assumption I made in which you do not agree, please comment also! I will start the thread with 2 suggestions: 1) *PostgreSQL*: Seems to be able to handle the volume of the data and knows how to implement diminishing resolution for timeseries. 2) *Project Voldemort*: As we are denormalizing the cube entirely for anonymity, the db doesn't need to compute aggregations, so it may well be a key-value store. Cheers! Marcel

10 23

Github releases Brubeck, a statsd-compatible daemon written in C
by Gilles Dubuc 16 Jun '15

16 Jun '15

tl;dr statsd's UDP socket was getting saturated and since it was due to the nature of node.js, github rewrote a daemon in C that can handle the load they're dealing with. They've been running it in production for 3 years and released it yesterday. http://githubengineering.com/brubeck/ https://github.com/github/brubeck

1 0

Re: [Analytics] [Wikimedia-l] Wikipedia article per speaker
by Asaf Bartov 14 Jun '15

14 Jun '15

(adding Analytics, as a relevant group for this discussion.) I think this is next to meaningless, because the differing bot policies and practices on different wikis skew the data into incoherence. The (already existing) metric of active-editors-per-million-speakers is, it seems to me, a far more robust metric. Erik Z.'s stats.wikimedia.org is offering that metric. A. On Sun, Jun 7, 2015 at 3:23 PM, Milos Rancic <millosh(a)gmail.com> wrote: > When you get data, at some point of time you start thinking about > quite fringe comparisons. But that could actually give some useful > conclusions, like this time it did [1]. > > We did the next: > * Used the number of primary speakers from Ethnologue. (Erik Zachte is > using approximate number of primary + secondary speakers; that could > be good for correction of this data.) > * Categorized languages according to the logarithmic number of > speakers: >=10k, >=100k, >=1M, >=10M, >=100M. > * Took the number of articles of Wikipedia in particular language and > created ration (number of articles / number of speakers). > * This list is consisted just of languages with Ethnologue status 1 > (national), 2 (provincial) or 3 (wider communication). In fact, we > have a lot of projects (more than 100) with worse language status; a > number of them are actually threatened or even on the edge of > extinction. > > Those are the preliminary results and I will definitely have to pass > through all the numbers. I fixed manually some serious errors, like > not having English Wikipedia itself inside of data :D > > Putting the languages into the logarithmic categories proved to be > useful, as we are now able to compare the Wikipedias according to > their gross capacity (numbers of speakers). I suppose somebody well > introduced into statistics could even create the function which could > be used to check how good one project stays, no matter of those strict > categories. > > It's obvious that as more speakers one language has, it's harder to > the community to follow the ratio. > > So, the winners per category are: > 1) >= 1k: Hawaiian, ratio 0.96900 > 2) >= 10k: Mirandese, ratio 0.18073 > 3) >= 100k: Basque, ratio 0.38061 > 4) >= 1M: Swedish, ratio 0.21381 > 5) >= 10M: Dutch, ratio 0.08305 > 6) >= 100M: English, ratio 0.01447 > > However, keep in mind that we removed languages not inside categories > 1, 2 or 3. That affected >=10k languages, as, for example, Upper > Sorbian stays much better than Mirandese (0.67). (Will fix it while > creating the full report. Obviously, in this case logarithmic > categories of numbers of speakers are much more important than what's > the state of the language.) > > It's obvious that we could draw the line between 1:1 for 1-10k > speakers to 10:1 for >=100M speakers. But, again, I would like to get > input of somebody more competent. > > One very important category is missing here and it's about the level > of development of the speakers. That could be added: GDP/PPP per > capita for spoken country or countries would be useful as measurement. > And I suppose somebody with statistical knowledge would be able to > give us the number which would have meaning "ability to create > Wikipedia article". > > Completed in such way, we'd be able to measure the success of > particular Wikimedia groups and organizations. OK. Articles per > speaker are not the only way to do so, but we could use other > parameters, as well: number of new/active/very active editors etc. And > we could put it into time scale. > > I'll make some other results. And to remind: I'd like to have the > formula to count "ability to create Wikipedia article" and then to > produce "level of particular community success in creating Wikipedia > articles". And, of course, to implement it for editors. > > [1] > https://docs.google.com/spreadsheets/d/1TYyhETevEJ5MhfRheRn-aGc4cs_6k45Gwk_… > > _______________________________________________ > Wikimedia-l mailing list, guidelines at: > https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines > Wikimedia-l(a)lists.wikimedia.org > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, > <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe> -- Asaf Bartov Wikimedia Foundation <http://www.wikimediafoundation.org> Imagine a world in which every single human being can freely share in the sum of all knowledge. Help us make it a reality! https://donate.wikimedia.org

4 6

Metrics about the external use of the Wikimedia APIs
by Quim Gil 12 Jun '15

12 Jun '15

I have been asking this question informally for too long, so here goes the formal request: Metrics about the external use of the Wikimedia APIs https://phabricator.wikimedia.org/T102079 We need them and, in fact, an outsider would be very surprised by the fact that we don't have them today and we are not looking at them regularly, just like we check page views and edits. It is a vague goal in a bumpy road, but I'm happy contributing at east questions about the metrics we need. The Engineering Community team wants to have this metric as main measurement of success of our performance (the more Wikimedia knowledge being spread and improved via our API, the better we are doing working with developers). -- Quim Gil Engineering Community Manager @ Wikimedia Foundation http://www.mediawiki.org/wiki/User:Qgil

3 3

Fwd: Some data on apps and web
by Adam Baso 11 Jun '15

11 Jun '15

Cross-posting to analytics. Props to Vibha for asking for the data. ---------- Forwarded message ---------- From: *Adam Baso* <abaso(a)wikimedia.org> Date: Wednesday, June 10, 2015 Subject: Some data on apps and web To: mobile-l <mobile-l(a)lists.wikimedia.org> Hi all, thought I'd share some data from a few queries around apps uniques and apps + web pageviews, etc. from recent history: https://www.mediawiki.org/wiki/Reading/2015-2016_Q1_Planning_Data We're at the beginning of our FY2015-2016 Q1 planning, and are also readying our thinking on Reading strategy for the longer haul, and I was hoping this may be of some use. -Adam

1 0

Tracking unload events with EventLogging
by Erik Bernhardson 11 Jun '15

11 Jun '15

Within the discovery team we are now looking into tracking dwell time and bounce rate for the pages linked from the SERP's. To accurately track dwell time we need to fire an event in the unload handler of article pages. Poking around in the EventLogging code i see we are now using sendBeacon if it is available, and thats great. The problem is all of the browsers that do not have sendBeacon (many) will not send this event. They will inject an img tag that will not be processed as the page is being unloaded. Searching around I saw some discussion about this almost a year ago, in may 2014, before sendBeacon support was added (in nov 2014), titled "[Analytics] Using EventLogging for funnel analysis". There it was proposed to push the events into localStorage to be sent during a future page view. I don't see any other viable options, so wondering if there is any reason i shouldn't look into implementing this now (with the jStorage wrapper of localStorage available in core)? Erik B.

3 2

"Maybe Analytics" project in Phabricator
by Andre Klapper 10 Jun '15

10 Jun '15

Today somebody on IRC pointed out the existence of https://phabricator.wikimedia.org/tag/maybe_analytics/ which seems to be entirely unused (created in Feb 2015). Its description implies that its intended use is more or less the same as the #Blocked-on-Analytics project (created in Dec 2014). So can this project be archived? If not, how do you plan to actually use it? Generally speaking: I'm not aware of a task where the creation of this project was proposed / discussed. For future reference, please respect https://www.mediawiki.org/wiki/Phabricator/Creating_and_renaming_projects andre -- Andre Klapper | Wikimedia Bugwrangler http://blogs.gnome.org/aklapper/

4 6

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics June 2015