Analytics February 2016

analytics@lists.wikimedia.org

38 participants
28 discussions

Wikipedia aggregate clickstream data released
by Dario Taraborelli 17 Jan '18

17 Jan '18

We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770> This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015. This data can be used for various purposes: • determining the most frequent links people click on for a given article • determining the most common links people followed to an article • determining how much of the total traffic to an article clicked on a link in that article • generating a Markov chain over English Wikipedia We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream> Ellery and Dario

4 3

Request stream data set for cache tuning
by Daniel Berger 31 Aug '16

31 Aug '16

Hi everyone, I'm a phd student studying mathematical models to improve the hit ratio of web caches. In my research community, we are lacking realistic data sets and frequently rely on outdated modelling assumptions. Previously, (~2007) a trace containing 10% of user requests issued to the Wikipedia was publicly released [1]. This data set has been used widely for performance evaluations of new caching algorithms, e.g., for the new Caffeine caching framework for Java [2]. I would like to ask for your comments about compiling a similar (updated) data set and making it public. In my understanding, the necessary logs are readily available, e.g., in the Analytics/Data/Mobile requests stream [3] on stat1002, with a sampling rate of 1:100. As this request stream contains sensitive data (e.g., client IPs), it would need anonymization before making it public. It would be glad to help with that. The previously released data set [1] contains no client information. It contains 1) a counter, 2) a timestamp, 3) the URL, and 4) an update flag. I would additionally suggest to include 5) the cache's hostname, 6) the cache_status, and 7) the response size (from the Wikimedia cache log format). I believe this format would preserve anonymity, and would be interesting for many researchers. Let me know your thoughts. Thanks, Daniel Berger http://disco.cs.uni-kl.de/index.php/people/daniel-s-berger [1] http://www.wikibench.eu/?page_id=60 [2] https://github.com/ben-manes/caffeine/wiki/Efficiency [3] https://wikitech.wikimedia.org/wiki/Analytics/Data/Mobile_requests_stream

3 9

Video view stats
by Andrew Gray 17 May '16

17 May '16

Hi all, I hacked up a very quick count of the 2015 video viewing aggregate figures, using the data that Bartosz put together last year - with the caveat that the data only goes up to 10 December, but it's probably indicative of whole-year trends. I haven't yet tried to merge in the 11-31/12 data. Nothing very insightful but I don't recall seeing it done before, so it might be of interest! http://www.generalist.org.uk/blog/2016/most-popular-videos-on-wikipedia/ The headline figure is that we had about three billion (!!) video/audio plays during the year, and that some of the most popular items are insanely popular - the most popular was viewed an average of 42,000 times a day, every day. Pine: the video you asked about in the other thread was viewed 187,899 times from 31/10/15 to 10/12/15. So there's half your answer :-) -- - Andrew Gray andrew.gray(a)dunelm.org.uk

4 5

Echo schema eventlogging
by Nuria Ruiz 02 Mar '16

02 Mar '16

Roan: The data for Echo schema(https://meta.wikimedia.org/wiki/Schema:Echo) is quite large and we are not sure is even used. Can you confirm either way? If it is no longer used we will stop collecting it. Thanks, Nuria

9 18

Re: [Analytics] Please provide feedback on suggested improvements to the Code of Conduct
by reguyla＠gmail.com 01 Mar '16

01 Mar '16

I notice you mention in a lot of places that people should contact an administrator. What if the person they are complaining about is an admin? Because I have seen admins violate several of these rules of conduct and it's rare for anything to be donecabout it. Reguyla Sent from my T-Mobile 4G LTE device ------ Original message------From: Matthew FlaschenDate: Tue, Feb 23, 2016 8:45 PMTo: Wikitech List;Engineering List;Design List;Wiki Research List;Analytics Public List;hackathonorganizers@lists.wikimedia.org;Subject:[Analytics] Please provide feedback on suggested improvements to the Code of Conduct Thanks to everyone who’s helped work on the Code of Conduct so far.People have brought up issues they feel were missed when working on "Unacceptable behavior" ( https://www.mediawiki.org/wiki/Code_of_Conduct/Draft#Unacceptable_behavior ) and "Report a problem" ( https://www.mediawiki.org/wiki/Code_of_Conduct/Draft#Report_a_problem ). Consultants have also suggested changes in these same sections.These are important sections, so please take a look at the proposed changes ( https://www.mediawiki.org/wiki/Talk:Code_of_Conduct/Draft#Suggested_changes). I apologize that this feedback arrived later than planned, but I think this will create a better document.If you prefer to give your opinion privately, feedback via e-mail is welcome at conduct-discussion(a)wikimedia.org.Thanks,Matt Flaschen_______________________________________________Analytics mailing listAnalytics@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/analytics

2 1

Easier way to get and/or work with page counts?
by Dominic Della Sera 29 Feb '16

29 Feb '16

Hello there, I'm a beginner programmer and am working on a project using Wikipedia views, trying to reproduce a paper. I was pulling json from http://stats.grok.se but the last month isn't updated. I'm trying to get the last 30 days of wikipedia views for several pages, but the hourly files are 70mb compressed. Is it possible for me to query the specific data directly through some kind of Wikipedia database directly? Otherwise in regards to the files (located at http://dumps.wikimedia.org/other/pagecounts-raw/2016/2016-01/), should they be opened/accessed in some way? I'm using a high level language so working with these things sounds like it will break my computer. Like I said, I'm fairly new to this all so I apologize in advance if these questions seem silly. Thank you for any help, Dominic Della Sera

4 3

webrequest misc usage?
by Andrew Otto 29 Feb '16

29 Feb '16

Hi all, Ops is working on upgrading our web caching software from Varnish 3 to Varnish 4. At the moment, this would mean we would lost webrequest logs, since our webrequest logging software (varnishkafka) is incompatible with Varnish 4. There is an effort to fix this incompatibility, but Ops would like to be able to test Varnish 4 out in production in a limited fashion without having to worry about this blocker. Question: Do any of you use the webrequest_source=‘misc’ partition for production analysis? If this Hive partition did not have data for a limited period of time, would this break anyone’s data? If not, then Ops would like to use the misc cache cluster for testing this change. Thanks! -Andrew & Luca

1 0

Wiki Workshop 2016 @ ICWSM: deadline extended to March 3
by Dario Taraborelli 24 Feb '16

24 Feb '16

Hi all – heads up that we extended the submission deadline for the Wiki Workshop at ICWSM '16 to *Wednesday, March 3, 2016*. (The second deadline remains unchanged: March 11, 2016). You can check the workshop's website <http://snap.stanford.edu/wikiworkshop2016/> for submission instructions or follow us at @wikiworkshop16 <https://twitter.com/wikiworkshop16> for live updates. Looking forward to your contributions. Dario

1 0

Please provide feedback on suggested improvements to the Code of Conduct
by Matthew Flaschen 24 Feb '16

24 Feb '16

Thanks to everyone who’s helped work on the Code of Conduct so far. People have brought up issues they feel were missed when working on "Unacceptable behavior" ( https://www.mediawiki.org/wiki/Code_of_Conduct/Draft#Unacceptable_behavior ) and "Report a problem" ( https://www.mediawiki.org/wiki/Code_of_Conduct/Draft#Report_a_problem ). Consultants have also suggested changes in these same sections. These are important sections, so please take a look at the proposed changes ( https://www.mediawiki.org/wiki/Talk:Code_of_Conduct/Draft#Suggested_changes). I apologize that this feedback arrived later than planned, but I think this will create a better document. If you prefer to give your opinion privately, feedback via e-mail is welcome at conduct-discussion(a)wikimedia.org. Thanks, Matt Flaschen

1 0

Analytics Cluster maintenance for CDH 5.5 upgrade
by Andrew Otto 23 Feb '16

23 Feb '16

Hiya, We’re ready to upgrade the Analytics Cluster to CDH 5.5. To do so, we need to schedule a maintenance period during which we can stop all Hadoop related services. This includes Hive, Oozie, Spark, etc. I’d like to plan this for Tuesday February 23rd starting at 14:00 UTC (09:00 US east coast, 06:00 US west coast). We’ve practiced this upgrade a few times in labs now, and I don’t foresee any issues. I predict that it will take us no more than 2 hours to finish, but just in case I’d like to reserve 8 hours for this. Please plan on not using the Analytics Cluster between 14:00 and 22:00 on February 23rd. I will update this thread again when we are about to start, and when we are finished. Progress is being tracked here: https://phabricator.wikimedia.org/T119646 What we get: - http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_rn_new_i… - http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_rn_fixed… Thanks all! -Andrew + Analytics team

1 2

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics February 2016