Analytics July 2016

analytics@lists.wikimedia.org

29 participants
25 discussions

Beeline as Hive client
by Madhumitha Viswanathan 03 Oct '18

03 Oct '18

Hi all, For all Hive users using stat1002/1004, you might have seen a deprecation warning when you launch the hive client - that claims it's being replaced with Beeline. The Beeline shell has always been available to use, but it required supplying a database connection string every time, which was pretty annoying. We now have a wrapper <https://github.com/wikimedia/operations-puppet/blob/production/modules/role…> script setup to make this easier. The old Hive CLI will continue to exist, but we encourage moving over to Beeline. You can use it by logging into the stat1002/1004 boxes as usual, and launching `beeline`. There is some documentation on this here: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline. If you run into any issues using this interface, please ping us on the Analytics list or #wikimedia-analytics or file a bug on Phabricator <http://phabricator.wikimedia.org/tag/analytics>. (If you are wondering stat1004 whaaat - there should be an announcement coming up about it soon!) Best, --Madhu :)

3 3

Wikipedia aggregate clickstream data released
by Dario Taraborelli 17 Jan '18

17 Jan '18

We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770> This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015. This data can be used for various purposes: • determining the most frequent links people click on for a given article • determining the most common links people followed to an article • determining how much of the total traffic to an article clicked on a link in that article • generating a Markov chain over English Wikipedia We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream> Ellery and Dario

4 3

Request stream data set for cache tuning
by Daniel Berger 31 Aug '16

31 Aug '16

Hi everyone, I'm a phd student studying mathematical models to improve the hit ratio of web caches. In my research community, we are lacking realistic data sets and frequently rely on outdated modelling assumptions. Previously, (~2007) a trace containing 10% of user requests issued to the Wikipedia was publicly released [1]. This data set has been used widely for performance evaluations of new caching algorithms, e.g., for the new Caffeine caching framework for Java [2]. I would like to ask for your comments about compiling a similar (updated) data set and making it public. In my understanding, the necessary logs are readily available, e.g., in the Analytics/Data/Mobile requests stream [3] on stat1002, with a sampling rate of 1:100. As this request stream contains sensitive data (e.g., client IPs), it would need anonymization before making it public. It would be glad to help with that. The previously released data set [1] contains no client information. It contains 1) a counter, 2) a timestamp, 3) the URL, and 4) an update flag. I would additionally suggest to include 5) the cache's hostname, 6) the cache_status, and 7) the response size (from the Wikimedia cache log format). I believe this format would preserve anonymity, and would be interesting for many researchers. Let me know your thoughts. Thanks, Daniel Berger http://disco.cs.uni-kl.de/index.php/people/daniel-s-berger [1] http://www.wikibench.eu/?page_id=60 [2] https://github.com/ben-manes/caffeine/wiki/Efficiency [3] https://wikitech.wikimedia.org/wiki/Analytics/Data/Mobile_requests_stream

3 9

Pagecount Datasets to be Deprecated at the end of May
by Dan Andreescu 08 Aug '16

08 Aug '16

Just a reminder, we will be deprecating the pagecounts datasets at the end of May, as we mentioned earlier this year [0]. This means these files will remain there to be used by researchers but new files will not be generated in the future. *Pagecounts datasets that will be deprecated* pagecounts-raw pagecounts-all-sites Options for switching to the new datasets [1]: pageviews for the same format but better quality data pagecounts-ez for compressed data [0] https://lists.wikimedia.org/pipermail/analytics/2016-March/005060.html [1] https://dumps.wikimedia.org/other/analytics/

4 7

Full Pageviews API and Unique Devices API Support Added to Wikipedia Tools for Google Spreadsheets Add-on
by Thomas Steiner 01 Aug '16

01 Aug '16

Dear Analytics subscribers, A quick (re-)plug for my Google Spreadsheets add-on Wikipedia Tools [0] that, as of today, has gained the full expressive power of the amazing(!) Pageviews API [1] and Unique Devices API [2]. The GitHub Issue comment has more info [3], but as a teaser a literally three-formula evergreen dashboard that always shows yesterday's most-viewed Wikipedia pages broken down by Desktop, Mobile App, and Mobile Web. Hope this is useful to some on the list! Feedback appreciated… Cheers, Tom -- [0] https://chrome.google.com/webstore/detail/wikipedia-tools/aiilcelhmpllcgkhh… [1] https://wikimedia.org/api/rest_v1/?doc#/Pageviews_data [2] https://wikimedia.org/api/rest_v1/?doc#/Unique_devices_data [3] https://github.com/tomayac/wikipedia-tools-for-google-spreadsheets/issues/6… [4] https://docs.google.com/spreadsheets/d/1e5jppJq59zhYzCw7wle1gj3fo7HjLCGwvFd… -- Dr. Thomas Steiner, Employee (http://blog.tomayac.com, https://twitter.com/tomayac) Google Germany GmbH, ABC-Str. 19, 20354 Hamburg, Germany Managing Directors: Matthew Scott Sucherman, Paul Terence Manicle Registration office and registration number: Hamburg, HRB 86891 -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.29 (GNU/Linux) iFy0uwAntT0bE3xtRa5AfeCheCkthAtTh3reSabiGbl0ck0fjumBl3DCharaCTersAttH3b0ttom hTtPs://xKcd.cOm/1181/ -----END PGP SIGNATURE-----

2 1

[Wikistats 2.0] [Regular Update] First update on Wikistats 2.0
by Dan Andreescu 01 Aug '16

01 Aug '16

Hi, Welcome to the first of a series of semi-regular updates on our progress towards Wikistats 2.0. As you may have seen from the banners on stats.wikimedia.org, we're working on a replacement for Wikistats. Erik talked about this in his announcement [1]. To summarize it from our point of view: * Wikistats has served the community very well so far, and we're looking to keep every bit of value in the upgrade * Wikistats depends on the dumps generation process which is getting slower and slower due to its architecture. Because of this, most editing metrics are delayed by weeks through no fault of the Wikistats implementation * Finding data on Wikistats is a bit hard for new users, so we're working on new ways to organize what's available and present it in a comprehensive way along with other data sources like dumps This regular update is meant to keep interested people informed on the direction and progress of the project. Of course, Wikistats 2.0 is not a new project. We've already replaced the data pipeline behind the pageview reports on stats.wikimedia.org already. But the end goal is a new data pipeline for editing, reading, and beyond, plus a nice UI to help guide people to what they need. Since this is the first update, I'll lay out the high level milestones along with where we are, and then I'll give detail about the last few weeks of work. 1. [done] Build pipeline to process and analyze *pageview* data 2. [done] Load pageview data into an *API* 3. [ ] *Sanitize* pageview data with more dimensions for public consumption 4. [ ] Build pipeline to process and analyze *editing* data 5. [ ] Load editing data into an *API* 6. [ ] *Sanitize* editing data for public consumption 7. [ ] *Design* UI to organize dashboards built around new data 8. [ ] Build enough *dashboards* to replace the main functionality of stats.wikipedia.org 9. [ ] Officially Replace stats.wikipedia.org with *(maybe) analytics.wikipedia.org <http://analytics.wikipedia.org>* ***. [ ] Bonus: *replace dumps generation* based on the new data pipelines Our focus last year was pageview data, and that's how we got 1 and 2 done. 3 is mostly done except deploying the logic and making the data available. So 4, 5, and 6 are what we're working on now. As we work on these pieces, we'll take vertical slices of different important metrics and take them from the data processing all the way to the dashboards that present the results. That means we'll make incremental progress on 8 and 9 as we go. But we won't be able to finish 7 and 9 until we have a cohesive design to wrap around it all. We don't want to introduce yet more dashboard hell, we want to save you the consumers from all that. So the focus right now is on the editing data pipeline. What do I mean by this? Data is already available in quarry and via the API. That's true, but here are some problems with that data: * lack of historical change information. For example, we only have pageview data by the title of the page. If we wanted to get all the pageviews for a page that's now called C, but was called B two months ago and A three months before that, we have to manually parse PHP-serialized parameters in the logging table to trace back those page moves * no easy way to look at data across wikis. If someone asks you to run a quarry query to look at data from all wikipedias, you have to run hundreds of separate queries, one for each database * no easy way to look at a lot of data. Quarry and other tools time out after a certain amount of time to protect themselves. Downloading dumps is a way to get access to more data but the files are huge and analysis is hard * querying the API with complex multi-dimensional analytics questions isn't possible These are the kinds of problems we're trying to solve. Our progress so far: * Retraced history through the logging table to piece together what names each page has had throughout its life. Deleted pages were included in this reconstruction * Found what names each user has had throughout their life. And what rights and blocks were applied to or removed from users. * Wrote event schemas for Event Bus, which will feed data into this pipeline in near real time (so metrics and dashboards can be updated in near-real-time) * Come up with a single denormalized schema that holds every single kind of event possible in the editing world. This is a join of the Event Bus schemas mentioned above and is possible to feed either in batch from our reconstruction algorithm or in real time. If you're familiar with lambda architecture, this is the approach we're taking to make our editing data available Right now we're testing the accuracy of our reconstruction against Wikistats data. If this works, we'll open up the schema to more people to play with so they can give feedback on this way of doing analytics. And if all that looks good, we'll be loading the data into Druid and Hive and running the most high priority metrics on this new platform. We hope to be done with this by the end of this quarter. To weigh in on what reports are important, make sure you visit Erik's page [2]. We'll also do a tech talk on our algorithm for historical reconstruction and lessons learned on mediawiki analytics. If you're still reading, congratulations, sorry for the wall of text. I look forward to keeping you all in the loop, and to making steady progress on this project that's very dear to our hearts. Feel free to ask questions and if you'd like to be involved, just let me know how. Have a nice weekend :) [1] http://infodisiac.com/blog/2016/05/wikistats-days-will-be-over-soon-long-li… [2] https://www.mediawiki.org/wiki/Analytics/Wikistats/DumpReports/Future_per_r…

4 3

Q4-2016 (April-June) quarterly report for Wikimedia Research
by Dario Taraborelli 30 Jul '16

30 Jul '16

This is what we've been up to at Wikimedia Research this past quarter (April - June 2016): - Research and Data <https://commons.wikimedia.org/w/index.php?title=File:Technology_Quarterly_R…> - Design Research <https://commons.wikimedia.org/w/index.php?title=File%3ATechnology_Quarterly…> You might also be interested in the Analytics Engineering <https://commons.wikimedia.org/w/index.php?title=File:Technology_Quarterly_R…> team's quarterly report. Best, Dario *Dario Taraborelli *Head of Research, Wikimedia Foundation wikimediafoundation.org • nitens.org • @readermeter <http://twitter.com/readermeter>

1 0

[Pageview API] Data Retention Question
by Dan Andreescu 30 Jul '16

30 Jul '16

Dear Pageview API consumers, We would like to plan storage capacity for our pageview API cluster. Right now, with a reliable RAID setup, we can keep *18 months* of data. If you'd like to query further back than that, you can download dump files (which we'll make easier to use with python utilities). What do you think? Will you need more than 18 months of data? If so, we need to add more nodes when we get to that point, and that costs money, so we want to check if there is a real need for it. Another option is to start degrading the resolution for older data (only keep weekly or monthly for data older than 1 year for example). If you need more than 18 months, we'd love to hear your use case and something in the form of: need daily resolution for 1 year need weekly resolution for 2 years need monthly resolution for 3 years Thank you! Dan

7 12

EventLogging new auto-purging strategies are about to be activated
by Marcel Ruiz Forns 29 Jul '16

29 Jul '16

Hi EventLogging schema owners, (cc-ing Analytics-l) EventLogging's new auto-purging mechanism is about to be productionized[1]. This implies that from now on: - EventLogging data will be purged according to the specific purging strategy of each schema. Those strategies were agreed with you all either in last year's audit (older schemas), or this year's re-audit (schemas created since then), and can be found in the schema talk pages. - The default purging strategy for new schemas will be: full auto-purge after 90 days. If you want to keep your data for more than that, please contact the Analytics team and we'll analyze your schema and discuss a purging strategy that fits your needs. - If you modify an existing schema, fields that were kept indefinitely in the previous revision will continue to be kept. However, if you add new fields to the schema, those will be auto-purged after 90 days by default. If you want to keep them for more than that, please contact the Analytics team, we'll discuss that and mark them to be kept. Please, feel free to ask any questions about this process. Thanks! [1] https://phabricator.wikimedia.org/T108850 -- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation

2 1

Re: [Analytics] [WikimediaMobile] Fwd: [Pageview API] Data Retention Question
by Corey Floyd 29 Jul '16

29 Jul '16

For the iOS app I can say that 18 months is more than enough for our current feature set and upcoming plans. Even if we began displaying graphs of page views over time… I can’t see any need to go back more than a few weeks or months. For historical data the idea of degrading is an interesting one. I think that the daily data becomes much less important as you go back in time. Even if we only kept daily data for 6 months, that would be enough for our use cases. This is probably true for Android as well, since we have pretty similar UI, but I’ll let them chime in to be sure. Let me know if you want to know any further info. On Fri, Jul 29, 2016 at 10:31 AM, Adam Baso <abaso(a)wikimedia.org> wrote: > Cross posting. > > > ---------- Forwarded message ---------- > From: *Dan Andreescu* <dandreescu(a)wikimedia.org> > Date: Friday, July 29, 2016 > Subject: [Analytics] [Pageview API] Data Retention Question > To: Analytics List <analytics(a)lists.wikimedia.org> > > > Dear Pageview API consumers, > > We would like to plan storage capacity for our pageview API cluster. > Right now, with a reliable RAID setup, we can keep *18 months* of data. > If you'd like to query further back than that, you can download dump files > (which we'll make easier to use with python utilities). > > What do you think? Will you need more than 18 months of data? If so, we > need to add more nodes when we get to that point, and that costs money, so > we want to check if there is a real need for it. > > Another option is to start degrading the resolution for older data (only > keep weekly or monthly for data older than 1 year for example). If you > need more than 18 months, we'd love to hear your use case and something in > the form of: > > need daily resolution for 1 year > need weekly resolution for 2 years > need monthly resolution for 3 years > > Thank you! > > Dan > > > _______________________________________________ > Mobile-l mailing list > Mobile-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/mobile-l > > -- Corey Floyd Software Engineer Reading / iOS Wikimedia Foundation

1 0

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics July 2016