Analytics March 2016

analytics@lists.wikimedia.org

33 participants
23 discussions

Wikipedia aggregate clickstream data released

by Dario Taraborelli

We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770> This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015. This data can be used for various purposes: • determining the most frequent links people click on for a given article • determining the most common links people followed to an article • determining how much of the total traffic to an article clicked on a link in that article • generating a Markov chain over English Wikipedia We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream> Ellery and Dario

6 years, 3 months

Request stream data set for cache tuning

by Daniel Berger

Hi everyone, I'm a phd student studying mathematical models to improve the hit ratio of web caches. In my research community, we are lacking realistic data sets and frequently rely on outdated modelling assumptions. Previously, (~2007) a trace containing 10% of user requests issued to the Wikipedia was publicly released [1]. This data set has been used widely for performance evaluations of new caching algorithms, e.g., for the new Caffeine caching framework for Java [2]. I would like to ask for your comments about compiling a similar (updated) data set and making it public. In my understanding, the necessary logs are readily available, e.g., in the Analytics/Data/Mobile requests stream [3] on stat1002, with a sampling rate of 1:100. As this request stream contains sensitive data (e.g., client IPs), it would need anonymization before making it public. It would be glad to help with that. The previously released data set [1] contains no client information. It contains 1) a counter, 2) a timestamp, 3) the URL, and 4) an update flag. I would additionally suggest to include 5) the cache's hostname, 6) the cache_status, and 7) the response size (from the Wikimedia cache log format). I believe this format would preserve anonymity, and would be interesting for many researchers. Let me know your thoughts. Thanks, Daniel Berger http://disco.cs.uni-kl.de/index.php/people/daniel-s-berger [1] http://www.wikibench.eu/?page_id=60 [2] https://github.com/ben-manes/caffeine/wiki/Efficiency [3] https://wikitech.wikimedia.org/wiki/Analytics/Data/Mobile_requests_stream

7 years, 7 months

Video view stats

by Andrew Gray

Hi all, I hacked up a very quick count of the 2015 video viewing aggregate figures, using the data that Bartosz put together last year - with the caveat that the data only goes up to 10 December, but it's probably indicative of whole-year trends. I haven't yet tried to merge in the 11-31/12 data. Nothing very insightful but I don't recall seeing it done before, so it might be of interest! http://www.generalist.org.uk/blog/2016/most-popular-videos-on-wikipedia/ The headline figure is that we had about three billion (!!) video/audio plays during the year, and that some of the most popular items are insanely popular - the most popular was viewed an average of 42,000 times a day, every day. Pine: the video you asked about in the other thread was viewed 187,899 times from 31/10/15 to 10/12/15. So there's half your answer :-) -- - Andrew Gray andrew.gray(a)dunelm.org.uk

7 years, 11 months

Sections of Code of Conduct resolved and Code of Conduct approval process

by Matthew Flaschen

We’ve gotten good participation as we’ve worked on sections of the Code of Conduct over the past few months, and have made considerable improvements to the draft based on your feedback. Given that, and the community approval through the discussions on each section, the best approach is to proceed by approving section-by-section until the last section is done. So, please continue to improve the Code of Conduct by participating now and as future sections are discussed. When the last section is completed and approved on the talk page, the Code of Conduct will become policy and no longer be marked as a draft. Also, two more discussions regarding the Code of Conduct have been resolved and incorporated into the draft. * "Enforcement issues" addressed the reporting process and clarified that Committee decisions could not be circumvented * "Marginalized and underrepresented groups" forbids discrimination Thanks, Matt Flaschen

8 years

db1047 corruption issues on Edit schema - use analytics-store

by Jaime Crespo

There seems to be issues with the Edit_13457736 table on the db log, on the db1047 host. db1047 may be known better by some of you as you as "s1-analytics-slave" or "analytics-slave". You can use dbstore1002 for the time being ("analytics-store" or "sX-analytics-slave", where X is 2-7) to read from that table. Other "schemas" are not affected. I am fixing those right now, but it will take some time, as I may have to handle (again) the recently purged rows on that table. Worst case scenario, I may have to do a quick reboot of that machine. I will update the status soon to communicate next steps. -- Jaime Crespo <http://wikimedia.org>

8 years

Fwd: Please provide feedback on new discrimination and enforcement sections of Code of Conduct

by Matthew Flaschen

I usually send these to multiple lists, but I realized I forgot to send this to the ones besides wikitech-l. The "Marginalized and underrepresented groups" discussion (https://www.mediawiki.org/wiki/Talk:Code_of_Conduct/Draft#New_proposed_word…) is still open. I'll probably give it two weeks total, which means closing it late tomorrow. Matt Flaschen -------- Forwarded Message -------- Subject: Please provide feedback on new discrimination and enforcement sections of Code of Conduct Date: Wed, 16 Mar 2016 20:23:24 -0400 From: Matthew Flaschen <mflaschen(a)wikimedia.org> To: Wikitech List <wikitech-l(a)lists.wikimedia.org> Thanks for your participation in the recent Code of Conduct discussions. The "Marginalized and underrepresented groups" discussion had a lot of feedback. There was not consensus to use the exact original wording, but many people expressed willingness to support a modified text. I've proposed such a new text, based on Neil P. Quinn's text, with a small modification to account for discrimination required by law (e.g. age of people who can sign certain contracts). Please participate at https://www.mediawiki.org/wiki/Talk:Code_of_Conduct/Draft#New_proposed_word… . The "Enforcement issues" section received general support, but some of that was conditional, or expressed preference for wording that developed during the discussion. The original wording also did not address the appeals body, which was raised in the discussion. Please participate at https://www.mediawiki.org/wiki/Talk:Code_of_Conduct/Draft#Circumvention_tex… Update regarding completed discussions: The "Clarification of legitimate reasons for publication of private communications and identity protection" and "Definitions - trolling, bad-faith reports" discussions have been closed. They both had support, and I've incorporated the text into the draft. Thanks, Matt

8 years

Hadoop cluster - Added automatic failover to the HDFS Namenode

by Luca Toscano

Hi! TL;DR: The Analytics team added automatic failover for the HDFS Namenode. A new daemon is running on the analytics1001/1002 hosts called hadoop-hdfs-zkfc (port 8019) responsible to talk with Zookeeper and execute periodical health checks. Monitoring and Ferm rules has been added. More info: 1) https://phabricator.wikimedia.org/T129838 2) https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/HDFSH… 3) Monitoring and Ferm rules added: https://gerrit.wikimedia.org/r/#/c/279408/ Let me know if you have any questions! Luca

8 years, 1 month

Edit-Analysis Dashboard back on track

by Marcel Ruiz Forns

Hi editing, Just to let you know that after the modifications to the Edit table in EL database, the reports have been able to catch up and back-fill until today. So https://edit-analysis.wmflabs.org/compare/ is working again. Cheers! -- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation

8 years, 1 month

Re: [Analytics] [wmf.webrequest data] one-time access

by Michal Bystricky

We would like to have URI addresses of requests for some time of usage - let's say 1 month. According to the data format <https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest>, the attributes of Webrequests we need are following: http_method, uri_host, uri_path, uri_query, ts, access_method, agent_type, pageview_info, page_id Do we need to go through NDA process or it is possible to get the data right away from the public dataset? Thank you, M. > > Can you be more specific about what you need, Michal? If you truly > need access to the private data that we keep in wmf.webrequest for a > limited time, then you'd have to go through a process to sign an NDA. > But if you tell us what you need, there may be a public dataset that > you can use. > > On Thu, Mar 3, 2016 at 2:48 PM, Michal Bystricky > <michal.bystricky(a)stuba.sk <mailto:michal.bystricky@stuba.sk>> wrote: > > Hello Analytics Team, > > We would like to have one-time access to wmf.webrequest data. What > is the correct way of accessing the data? > > In our research group, we want to simulate the requests for > specific version of WikiMedia. > > Thanks, > Michal Bystricky > > > > _______________________________________________ > Analytics mailing list > Analytics(a)lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> > https://lists.wikimedia.org/mailman/listinfo/analytics > > >

8 years, 1 month

[Data Release] [Data Deprecation] [Analytics Dumps]

by Dan Andreescu

We're happy to announce a few improvements to Analytics data releases on dumps.wikimedia.org: * We are releasing a new dataset, an estimate of Unique Devices accessing our projects [1] * We are officially making available a better Pageviews dataset [2] * We are deprecating two older pageview statistics datasets * We moved Analytics data from /other to /analytics [3] Details follow: *Unique Devices:* Since 2009, the Wikimedia Foundation used comScore to report data about unique web visitors. In January 2016, however, we decided to stop reporting comScore numbers [4] because of certain limitations in the methodology, these limitations translated into misreported mobile usage. We are now ready to replace comscore numbers with the Unique Devices Dataset [5][1]. While unique devices does not equal unique visitors, it is a good proxy for that metric, meaning that a major increase in the number of unique devices is likely to come from an increase in distinct users. We understand that counting uniques raises fairly big privacy concerns and we use a very private conscious way to count unique devices, it does not include any cookie by which your browser history can be tracked [6]. We invite you to explore this new dataset and hope it’s helpful for the Wikimedia community in better understanding our projects. This data can help measurethe reach of wikimedia projects on the web. *Pageviews:* This [2] is the best quality data available for counting the number of pageviews our projects receive at the article and project level. We've upgraded from pagecounts-raw to pagecounts-all-sites, and now to pageviews, in order to filter out more spider traffic and measure something closer to what we think is a real user viewing content. A short history might be useful: * pagecounts-raw: was maintained by Domas Mituzas originally and taken over by the analytics team. It was and still is the most used dataset, though it has some majore problems. It does not count access to the mobile site, it does not filter out spider or bot traffic, and it suffers from unknown loss due to logging infrastructure limitations. * pagecounts-all-sites: uses the same pageview definition as pagecounts-raw, and so also does not filter out spider or bot traffic. But it does include access to mobile and zero sites, and is built on a more reliable logging infrastructure. * pagecounts-ez: is derived from the best data available at the time. So until December 2015, it was based on pagecounts-raw and pagecounts-all-sites, and now it's based on pageviews. This dataset is great because it compresses very large files without losing any information, still providing hourly page and project level statistics. So the new dataset, pageviews, is what's behind our pageview API and is now available in static files for bulk download back to May 2015. But the multiple ways to download pageview data is confusing for consumers, so we're keeping only pageviews and pagecounts-ez and deprecating the other two. If you'd like to read more about the current pageview definition, details are on the research page [7]. *Deprecating:* We are deprecating the pagecounts-raw and pagecounts-all-sites datasets in May 2016 (discussion here: https://phabricator.wikimedia.org/T130656 ). This data suffers from many artifacts, lack of mobile data, and/or infrastructure problems, and so is not comparable to the new way we track pageviews. It will remain here because we have historical data that may be useful, but it will not be maintained or updated beyond May 2016. *Clean-up:* Analytics data on dumps was crammed into /other with unrelated datasets. We made a new page to receive current and future datasets [3] and linked to it from /other and /. Please let us know if anything there looks confusing or opaque and I'll be happy to clarify. [1] http://dumps.wikimedia.org/other/unique_devices [2] http://dumps.wikimedia.org/other/pageviews [3] http://dumps.wikimedia.org/analytics/ [4] https://meta.wikimedia.org/wiki/ComScore/Announcement [5] https://meta.wikimedia.org/wiki/Research:Unique_Devices [6] https://meta.wikimedia.org/wiki/Research:Unique_Devices#How_do_we_count_uni… [7] https://meta.wikimedia.org/wiki/Research:Page_view

8 years, 1 month

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics March 2016