Analytics January 2016

analytics@lists.wikimedia.org

39 participants
34 discussions

Wikipedia aggregate clickstream data released

by Dario Taraborelli

We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770> This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015. This data can be used for various purposes: • determining the most frequent links people click on for a given article • determining the most common links people followed to an article • determining how much of the total traffic to an article clicked on a link in that article • generating a Markov chain over English Wikipedia We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream> Ellery and Dario

6 years, 3 months

Video view stats

by Andrew Gray

Hi all, I hacked up a very quick count of the 2015 video viewing aggregate figures, using the data that Bartosz put together last year - with the caveat that the data only goes up to 10 December, but it's probably indicative of whole-year trends. I haven't yet tried to merge in the 11-31/12 data. Nothing very insightful but I don't recall seeing it done before, so it might be of interest! http://www.generalist.org.uk/blog/2016/most-popular-videos-on-wikipedia/ The headline figure is that we had about three billion (!!) video/audio plays during the year, and that some of the most popular items are insanely popular - the most popular was viewed an average of 42,000 times a day, every day. Pine: the video you asked about in the other thread was viewed 187,899 times from 31/10/15 to 10/12/15. So there's half your answer :-) -- - Andrew Gray andrew.gray(a)dunelm.org.uk

7 years, 11 months

Echo schema eventlogging

by Nuria Ruiz

Roan: The data for Echo schema(https://meta.wikimedia.org/wiki/Schema:Echo) is quite large and we are not sure is even used. Can you confirm either way? If it is no longer used we will stop collecting it. Thanks, Nuria

8 years, 1 month

[Pageviews] [Technical] Simplifying the available static dumps of pageview data

by Dan Andreescu

I should have started this discussion a while ago, but it's easier to catch up on work during vacation :) We currently have 3 available static file dumps of pageview data. I will explain them here and explain my thoughts on simplifying the situation. Feel free to turn this thread into a wiki. * PAGECOUNTS-RAW <http://dumps.wikimedia.org/other/pagecounts-raw/>. We have this data going back to 2007. This is using a very simple pageview definition which incorrectly counts things like banner views as pageviews (for example). * PAGECOUNTS-ALL-SITES <http://dumps.wikimedia.org/other/pagecounts-all-sites/>. We have this data starting in late 2014. Compared to PAGECOUNTS-RAW, this dataset also adds traffic from the mobile versions of our sites. But it's still using the same simple pageview definition. * PAGEVIEWS <http://dumps.wikimedia.org/other/pageviews/>. We have this data starting in May 2015. It implements the new and much improved pageview definition <https://meta.wikimedia.org/wiki/Research:Page_view> that we now use. This is the same pageview definition used in the pageview API. This dataset also removes spider traffic and any automata traffic that we can detect. All three datasets are in the same format (Domasz's archive format). So, before we can simplify this confusing situation, we need your help and input about what to keep and how to keep it. Here's the approach I would take: Combine pagecounts-raw with pagecounts-all-sites into a new dataset called "pagecounts". Keep producing data to this dataset forever, but remove "pagecounts-raw" and "pagecounts-all-sites". This way, we can compare new data with historical data going back as far as we need. We would explain on dumps.wikimedia.org/other that this dataset gains mobile data starting in October 2014, to explain the relative local spike that happens there. This dataset would remain a pretty bad estimate of actual page views, and would remain sensitive to automata and spider spikes. But in combination with the "pageviews" dataset, I think it would be useful. What do you all think? Sound off in this thread, and if we have consensus I'll start the cleanup.

8 years, 2 months

WikimediaBot convention

by Marcel Ruiz Forns

Hi analytics list, In the past months the WikimediaBot convention has been mentioned in a couple threads, but we (Analytics team) never finished establishing and advertising it. In this email we explain what the convention is today and what purpose it serves. And also ask for feedback to be sure we can continue with the next steps. What is the WikimediaBot convention? It is a way of better identifying Wikimedia traffic originated by bots. Today we know that a significant share of Wikimedia traffic comes from bots. We can recognize a part of that traffic with regular expressions[1], but we can not recognize all of it, because some bots do not identify themselves as such. If we could identify a greater part of the bot traffic, we could also better isolate the human traffic and permit more accurate analyses. Who should follow the convention? Computer programs that access Wikimedia sites or the Wikimedia API for reading purposes* in a periodic, scheduled or automatically triggered way. Who should NOT follow the convention? Computer programs that follow the on-site ad-hoc commands of a human, like browsers. And well known spiders that are otherwise recognizable by their well known user-agent strings. How to follow the convention? The client's user-agent string should contain the word "WikimediaBot". The word can be anywhere within the user-agent string and is case-sensitive. So, please, feel free to post your comments/feedback on this thread. In the course of this discussion we can adjust the convention's definition and, if no major concerns are raised, in 2 weeks we'll create a documentation page in Wikitech, send an email to the proper mailing lists and maybe write a blog post about it. Thanks a lot! (*) There is already another convention[2] for bots that EDIT Wikimedia content. [1] https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery… [2] https://www.mediawiki.org/wiki/Manual:Bots -- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation

8 years, 2 months

Pageview stats tools

by Pine W

Hi Analytics folks, My understanding is that the new pageview definition, which excludes automata to a certain extent, is now published. I have a few questions: 1. Is stats.grok.se already transitioned to the new definition, or will it? 2. Is there a replacement for stats.grok.se planned or already available? A reliable substitute would be great, and it would be nice if we could either replace the existing on-wiki "page view statistics" link or add a supplemental link to the new resource. Apologizes if this information was already published and I missed it. Thanks, Pine

8 years, 2 months

Issues on Cluster

by Joseph Allemandou

Hi Analytics fellows, We are experiencing issues with loading data into the hadoop cluster, therefore blocking the full job pipeline. When fixed, the cluster will be heavily loaded trying to catch up, so please, be nice with it and don't run heavy jobs in the next hours. We'll keep you posted about resolution. Many thanks, and sorry for the inconvenience. Joseph

8 years, 2 months

Dumps not Appearing

by Patrick Robert O'Grady

Hey guys! Just wanted to let you know that no dumps for 1-27 are up. Pat

8 years, 2 months

Eventlogging Mysql consumers downtime

by Madhumitha Viswanathan

Hi all, In order to convert tables on db1046 to the TokuDB engine - we have to schedule some downtime on the Eventlogging databases between tomorrow, Thursday Jan 21, 2016 at 16:00 UTC to Monday Jan 25, 2016 16:00 UTC. What this means for EL users: 1. Eventlogging will still receive data and it will be available in Kafka. The data will continue to be imported into Hadoop and files. 2. The Mysql consumers of Eventlogging will be stopped, so no data will get imported into the master(db1046 or m4-master) and by extension to the analytics-store. 3. Querying existing data from analytics-store will still be available, but data for the next 4 days won't be available. 4. On Monday, after the maintenance window we'll restart the Mysql consumers, and all the data should get reimported from Kafka. Analytics and Ops(DBA) will work on this together. Feel free to reach out to us here or on #wikimedia-analytics if you have any concerns/questions. -- Madhu Viswanathan Software Engineer, Analytics

8 years, 3 months

[link] Why Big Data Needs Thick Data

by Asaf Bartov

Estonian Wikipedian Raul Veede, User:Oop, asked to relay this link to "the metrics people", so I am sending it here and to the Community Engagement team at the Wikimedia Foundation. <goog_433392935> https://medium.com/ethnography-matters/why-big-data-needs-thick-data-b4b3e7… Cheers, A. -- Asaf Bartov Wikimedia Foundation <http://www.wikimediafoundation.org> Imagine a world in which every single human being can freely share in the sum of all knowledge. Help us make it a reality! https://donate.wikimedia.org

8 years, 3 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics January 2016