Analytics May 2016

analytics@lists.wikimedia.org

34 participants
24 discussions

by Madhumitha Viswanathan

Hi all, For all Hive users using stat1002/1004, you might have seen a deprecation warning when you launch the hive client - that claims it's being replaced with Beeline. The Beeline shell has always been available to use, but it required supplying a database connection string every time, which was pretty annoying. We now have a wrapper <https://github.com/wikimedia/operations-puppet/blob/production/modules/role…> script setup to make this easier. The old Hive CLI will continue to exist, but we encourage moving over to Beeline. You can use it by logging into the stat1002/1004 boxes as usual, and launching `beeline`. There is some documentation on this here: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline. If you run into any issues using this interface, please ping us on the Analytics list or #wikimedia-analytics or file a bug on Phabricator <http://phabricator.wikimedia.org/tag/analytics>. (If you are wondering stat1004 whaaat - there should be an announcement coming up about it soon!) Best, --Madhu :)

5 years, 6 months

Wikipedia aggregate clickstream data released

by Dario Taraborelli

We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770> This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015. This data can be used for various purposes: • determining the most frequent links people click on for a given article • determining the most common links people followed to an article • determining how much of the total traffic to an article clicked on a link in that article • generating a Markov chain over English Wikipedia We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream> Ellery and Dario

6 years, 3 months

Request stream data set for cache tuning

by Daniel Berger

Hi everyone, I'm a phd student studying mathematical models to improve the hit ratio of web caches. In my research community, we are lacking realistic data sets and frequently rely on outdated modelling assumptions. Previously, (~2007) a trace containing 10% of user requests issued to the Wikipedia was publicly released [1]. This data set has been used widely for performance evaluations of new caching algorithms, e.g., for the new Caffeine caching framework for Java [2]. I would like to ask for your comments about compiling a similar (updated) data set and making it public. In my understanding, the necessary logs are readily available, e.g., in the Analytics/Data/Mobile requests stream [3] on stat1002, with a sampling rate of 1:100. As this request stream contains sensitive data (e.g., client IPs), it would need anonymization before making it public. It would be glad to help with that. The previously released data set [1] contains no client information. It contains 1) a counter, 2) a timestamp, 3) the URL, and 4) an update flag. I would additionally suggest to include 5) the cache's hostname, 6) the cache_status, and 7) the response size (from the Wikimedia cache log format). I believe this format would preserve anonymity, and would be interesting for many researchers. Let me know your thoughts. Thanks, Daniel Berger http://disco.cs.uni-kl.de/index.php/people/daniel-s-berger [1] http://www.wikibench.eu/?page_id=60 [2] https://github.com/ben-manes/caffeine/wiki/Efficiency [3] https://wikitech.wikimedia.org/wiki/Analytics/Data/Mobile_requests_stream

7 years, 7 months

Pagecount Datasets to be Deprecated at the end of May

by Dan Andreescu

Just a reminder, we will be deprecating the pagecounts datasets at the end of May, as we mentioned earlier this year [0]. This means these files will remain there to be used by researchers but new files will not be generated in the future. *Pagecounts datasets that will be deprecated* pagecounts-raw pagecounts-all-sites Options for switching to the new datasets [1]: pageviews for the same format but better quality data pagecounts-ez for compressed data [0] https://lists.wikimedia.org/pipermail/analytics/2016-March/005060.html [1] https://dumps.wikimedia.org/other/analytics/

7 years, 8 months

Survey for Wikipedia readers

by Vipul Naik

Hello Wikimedia analytics mailing list, As part of research into how people read Wikipedia, a friend and I created a short survey. We are interested in seeing how people on this mailing list (not a representative sample of Wikipedia readers for sure!) fill the survey. The survey should take 2 to 10 minutes to complete. https://www.surveymonkey.com/r/QBCCVFY I would also appreciate if any of you have the ability to circulate the survey to a different audience. If you are interested in doing that, please let me know (off-list, if you prefer) and I will give you a separate URL through which to do so for each such audience. The URLs represent different audiences to whom the survey is shared so that it is easier to understand how responses differ based on audience. Any feedback on the survey questions would also be appreciated, on- or off-thread. Thank you very much! Vipul

7 years, 9 months

Re: [Analytics] Analytics Digest, Vol 51, Issue 22

by Jeremiah Lewis

unsubscribe Jeremiah Lewis / Business Analyst /// Skype: jpsl91 Razorfish GmbH Stralauer Allee 2b 10245 Berlin Chamber of Commerce: Frankfurt am Main – HRB 45639, company registered and located in Frankfurt am Main. Directors: Sascha Martini, Ariel Marciano. Authorized signatory: Kai Greib. ________________________________________ From: Analytics <analytics-bounces(a)lists.wikimedia.org> on behalf of analytics-request(a)lists.wikimedia.org <analytics-request(a)lists.wikimedia.org> Sent: Friday, May 27, 2016 18:12 To: analytics(a)lists.wikimedia.org Subject: Analytics Digest, Vol 51, Issue 22 Send Analytics mailing list submissions to analytics(a)lists.wikimedia.org To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/analytics or, via email, send a message with subject or body 'help' to analytics-request(a)lists.wikimedia.org You can reach the person managing the list at analytics-owner(a)lists.wikimedia.org When replying, please edit your Subject line so it is more specific than "Re: Contents of Analytics digest..." Today's Topics: 1. vital signs (Toby Negrin) 2. Re: vital signs (Dmitry Brant) 3. Re: vital signs (Dan Andreescu) 4. Re: vital signs (Nuria Ruiz) 5. Re: vital signs (Jonas Augusto) ---------------------------------------------------------------------- Message: 1 Date: Fri, 27 May 2016 06:16:07 -0700 From: Toby Negrin <tnegrin(a)wikimedia.org> To: "A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics." <analytics(a)lists.wikimedia.org> Subject: [Analytics] vital signs Message-ID: <CAAjh0EwLPzM6s0AAXXixqoWDJkOx7prn6F7s6wLGGq-c_N2ejQ(a)mail.gmail.com> Content-Type: text/plain; charset="utf-8" I can't seem to get the page views report from vital signs to render: https://vital-signs.wmflabs.org/#projects=enwiki/metrics=Pageviews Other reports are working fine. Nothing urgent, just an FYI -Toby

7 years, 11 months

'Unique Devices' Data Visualizations Available

by Nuria Ruiz

Hello! The analytics team would like to announce that we have a new visualization for Unique Devices data. As you know Unique Devices [1] is our best proxy to calculate Unique Users. We would like to reiterate that the data is available in a public API that anyone can access [2]. We calculate Uniques daily and monthly. See, for example, "Daily Unique Devices" for Spanish Wikipedia versus French wikipedia: https://vital-signs.wmflabs.org/#projects=frwiki,eswiki/metrics=UniqueDevic… FYI that dashboard would not work on IE, only on Edge. Thanks, Nuria [1] https://meta.wikimedia.org/wiki/Research:Unique_Devices [2] https://wikitech.wikimedia.org/wiki/Analytics/Unique_Devices

7 years, 11 months

vital signs

by Toby Negrin

I can't seem to get the page views report from vital signs to render: https://vital-signs.wmflabs.org/#projects=enwiki/metrics=Pageviews Other reports are working fine. Nothing urgent, just an FYI -Toby

7 years, 11 months

analytics-store unscheduled maintenance

by Jaime Crespo

Hi,all, a few minutes ago dbstore1002, (I think you know it better as analytics-store) was forced to have an unscheduled maintenance A.K.A "it crashed and I am trying to give it first aid". Please use db1047 (analytics-slave?) for now, if you can. I will follow up with a state update once I know more. Sorry for the inconveniences, -- Jaime Crespo <http://wikimedia.org>

7 years, 11 months

Retrieving filenames for category

by Sander Ubink

Dear all, For a project we are trying to build an automatic analytical data extraction script similar to BaGLAMa. The BaGLAMa tool gives information about all media in a certain category. We cannot find out how BaGLAMa collects the filenames for all files within a category. Does someone know from which dump/api this can be retrieved? Regards, Sander Ubink

7 years, 11 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics May 2016