Hi,
Welcome to the first of a series of semi-regular updates on our progress towards Wikistats 2.0. As you may have seen from the banners on stats.wikimedia.org, we're working on a replacement for Wikistats. Erik talked about this in his announcement [1]. To summarize it from our point of view:
* Wikistats has served the community very well so far, and we're looking to keep every bit of value in the upgrade * Wikistats depends on the dumps generation process which is getting slower and slower due to its architecture. Because of this, most editing metrics are delayed by weeks through no fault of the Wikistats implementation * Finding data on Wikistats is a bit hard for new users, so we're working on new ways to organize what's available and present it in a comprehensive way along with other data sources like dumps
This regular update is meant to keep interested people informed on the direction and progress of the project.
Of course, Wikistats 2.0 is not a new project. We've already replaced the data pipeline behind the pageview reports on stats.wikimedia.org already. But the end goal is a new data pipeline for editing, reading, and beyond, plus a nice UI to help guide people to what they need. Since this is the first update, I'll lay out the high level milestones along with where we are, and then I'll give detail about the last few weeks of work.
1. [done] Build pipeline to process and analyze *pageview* data 2. [done] Load pageview data into an *API* 3. [ ] *Sanitize* pageview data with more dimensions for public consumption 4. [ ] Build pipeline to process and analyze *editing* data 5. [ ] Load editing data into an *API* 6. [ ] *Sanitize* editing data for public consumption 7. [ ] *Design* UI to organize dashboards built around new data 8. [ ] Build enough *dashboards* to replace the main functionality of stats.wikipedia.org 9. [ ] Officially Replace stats.wikipedia.org with *(maybe) analytics.wikipedia.org http://analytics.wikipedia.org* ***. [ ] Bonus: *replace dumps generation* based on the new data pipelines
Our focus last year was pageview data, and that's how we got 1 and 2 done. 3 is mostly done except deploying the logic and making the data available. So 4, 5, and 6 are what we're working on now. As we work on these pieces, we'll take vertical slices of different important metrics and take them from the data processing all the way to the dashboards that present the results. That means we'll make incremental progress on 8 and 9 as we go. But we won't be able to finish 7 and 9 until we have a cohesive design to wrap around it all. We don't want to introduce yet more dashboard hell, we want to save you the consumers from all that.
So the focus right now is on the editing data pipeline. What do I mean by this? Data is already available in quarry and via the API. That's true, but here are some problems with that data:
* lack of historical change information. For example, we only have pageview data by the title of the page. If we wanted to get all the pageviews for a page that's now called C, but was called B two months ago and A three months before that, we have to manually parse PHP-serialized parameters in the logging table to trace back those page moves * no easy way to look at data across wikis. If someone asks you to run a quarry query to look at data from all wikipedias, you have to run hundreds of separate queries, one for each database * no easy way to look at a lot of data. Quarry and other tools time out after a certain amount of time to protect themselves. Downloading dumps is a way to get access to more data but the files are huge and analysis is hard * querying the API with complex multi-dimensional analytics questions isn't possible
These are the kinds of problems we're trying to solve. Our progress so far:
* Retraced history through the logging table to piece together what names each page has had throughout its life. Deleted pages were included in this reconstruction * Found what names each user has had throughout their life. And what rights and blocks were applied to or removed from users. * Wrote event schemas for Event Bus, which will feed data into this pipeline in near real time (so metrics and dashboards can be updated in near-real-time) * Come up with a single denormalized schema that holds every single kind of event possible in the editing world. This is a join of the Event Bus schemas mentioned above and is possible to feed either in batch from our reconstruction algorithm or in real time. If you're familiar with lambda architecture, this is the approach we're taking to make our editing data available
Right now we're testing the accuracy of our reconstruction against Wikistats data. If this works, we'll open up the schema to more people to play with so they can give feedback on this way of doing analytics. And if all that looks good, we'll be loading the data into Druid and Hive and running the most high priority metrics on this new platform. We hope to be done with this by the end of this quarter. To weigh in on what reports are important, make sure you visit Erik's page [2]. We'll also do a tech talk on our algorithm for historical reconstruction and lessons learned on mediawiki analytics.
If you're still reading, congratulations, sorry for the wall of text. I look forward to keeping you all in the loop, and to making steady progress on this project that's very dear to our hearts. Feel free to ask questions and if you'd like to be involved, just let me know how. Have a nice weekend :)
[1] http://infodisiac.com/blog/2016/05/wikistats-days-will-be-over-soon-long-liv... [2] https://www.mediawiki.org/wiki/Analytics/Wikistats/DumpReports/Future_per_re...
בתאריך 30 ביולי 2016 06:22, "Dan Andreescu" dandreescu@wikimedia.org כתב:
Hi,
Welcome to the first of a series of semi-regular updates on our progress towards Wikistats 2.0.
Much appreciated. Updates about this are very interesting.
* Finding data on Wikistats is a bit hard for new users, so we're working
on new ways to organize what's available and present it in a comprehensive way along with other data sources like dumps
I should mention that there are quite a lot of things in Wikistats that are NOT hard to find :)
And I hope it will remain that way. A basic metric like active and very active users, and data for a language in relation to the number of its speakers are very straightforward, and should remain that way.
- [ ] *Sanitize* pageview data with more dimensions for public
consumption
6. [ ] *Sanitize* editing data for public consumption
This reminds me: Is there some kind of an open policy document about what is supposed to sanitized? The general idea is "user's private information", but I'd love details and examples, especially non-trivial ones. For example, I sometimes hear that grand total numbers are usually OK to publish, but some wikis are so small that even the bare numbers may make it possible to guess some private information. It would be lovely to have a written policy about this
- [ ] Officially Replace stats.wikipedia.org with *(maybe)
analytics.wikipedia.org http://analytics.wikipedia.org*
But please don't break existing links :)
- no easy way to look at data across wikis. If someone asks you to run a
quarry query to look at data from all wikipedias, you have to run hundreds of separate queries, one for each database
https://phabricator.wikimedia.org/T95582 :)
If you're still reading, congratulations, sorry for the wall of text.
No problem at all, very useful!
Thanks Dan, very helpful update.
Erik
From: Analytics [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Amir E. Aharoni Sent: Sunday, July 31, 2016 8:00 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] [Wikistats 2.0] [Regular Update] First update on Wikistats 2.0
בתאריך 30 ביולי 2016 06:22, "Dan Andreescu" < mailto:dandreescu@wikimedia.org dandreescu@wikimedia.org> כתב:
Hi,
Welcome to the first of a series of semi-regular updates on our progress towards Wikistats 2.0.
Much appreciated. Updates about this are very interesting.
* Finding data on Wikistats is a bit hard for new users, so we're working on new ways to organize what's available and present it in a comprehensive way along with other data sources like dumps
I should mention that there are quite a lot of things in Wikistats that are NOT hard to find :)
And I hope it will remain that way. A basic metric like active and very active users, and data for a language in relation to the number of its speakers are very straightforward, and should remain that way.
3. [ ] Sanitize pageview data with more dimensions for public consumption
6. [ ] Sanitize editing data for public consumption
This reminds me: Is there some kind of an open policy document about what is supposed to sanitized? The general idea is "user's private information", but I'd love details and examples, especially non-trivial ones. For example, I sometimes hear that grand total numbers are usually OK to publish, but some wikis are so small that even the bare numbers may make it possible to guess some private information. It would be lovely to have a written policy about this
9. [ ] Officially Replace stats.wikipedia.org with (maybe) analytics.wikipedia.org
But please don't break existing links :)
* no easy way to look at data across wikis. If someone asks you to run a quarry query to look at data from all wikipedias, you have to run hundreds of separate queries, one for each database
https://phabricator.wikimedia.org/T95582 :)
If you're still reading, congratulations, sorry for the wall of text.
No problem at all, very useful!
This reminds me: Is there some kind of an open policy document about what
is supposed to sanitized? The general idea is "user's private information", but I'd love >details and examples, especially non-trivial ones.
There is quite a bit of documentation about sanitization and I am including some links below. FYI that we will not be working on this area in the near term as our efforts are concentrated in scaling the pageview API and edit history reconstruction. Please see:
https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly/Identity_... https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly/Sanitizat...
On Sat, Jul 30, 2016 at 11:00 PM, Amir E. Aharoni < amir.aharoni@mail.huji.ac.il> wrote:
בתאריך 30 ביולי 2016 06:22, "Dan Andreescu" dandreescu@wikimedia.org כתב:
Hi,
Welcome to the first of a series of semi-regular updates on our progress towards Wikistats 2.0.
Much appreciated. Updates about this are very interesting.
- Finding data on Wikistats is a bit hard for new users, so we're working
on new ways to organize what's available and present it in a comprehensive way along with other data sources like dumps
I should mention that there are quite a lot of things in Wikistats that are NOT hard to find :)
And I hope it will remain that way. A basic metric like active and very active users, and data for a language in relation to the number of its speakers are very straightforward, and should remain that way.
- [ ] *Sanitize* pageview data with more dimensions for public
consumption
- [ ] *Sanitize* editing data for public consumption
This reminds me: Is there some kind of an open policy document about what is supposed to sanitized? The general idea is "user's private information", but I'd love details and examples, especially non-trivial ones. For example, I sometimes hear that grand total numbers are usually OK to publish, but some wikis are so small that even the bare numbers may make it possible to guess some private information. It would be lovely to have a written policy about this
- [ ] Officially Replace stats.wikipedia.org with *(maybe)
analytics.wikipedia.org http://analytics.wikipedia.org*
But please don't break existing links :)
- no easy way to look at data across wikis. If someone asks you to run a
quarry query to look at data from all wikipedias, you have to run hundreds of separate queries, one for each database
https://phabricator.wikimedia.org/T95582 :)
If you're still reading, congratulations, sorry for the wall of text.
No problem at all, very useful!
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics