We're starting to wrap up the calendar year, here's what we've accomplished so far with Wikistats. We're really excited to have some data in our production Hive database for people to play with. We worked really hard to clean up and present an intuitive interface to all of mediawiki history. The results are captured in the tables mentioned below, which we'll cover more in an upcoming tech talk. Documentation for the project is here https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake.
Our goals so far and progress breakdown:
1. [done] Build pipeline to process and analyze *pageview* data 2. [done] Load pageview data into an *API* 3. [ ] *Sanitize* pageview data with more dimensions for public consumption 4. [ beta] Build pipeline to process and analyze *editing* data 5. [ beta] Load editing data into an *API* 6. [ ] *Sanitize* editing data for public consumption 7. [ ] *Design* UI to organize dashboards built around new data 8. [ ] Build enough *dashboards* to replace the main functionality of stats.wikipedia.org 9. [ ] Officially Replace stats.wikipedia.org with *(maybe) analytics.wikipedia.org http://analytics.wikipedia.org/* ***. [ ] Bonus: *replace dumps generation* based on the new data pipelines
4 & 5. Since our last update, we've finished the pipeline that imports data from mediawiki databases, cleans it up as best as possible, reshapes it in a analytics-friendly way, and makes it easily queryable. I'm marking these goals as "beta" because we're still tweaking the algorithm for performance and productionizing the jobs. This will be completed early next quarter, but in the meantime we have data for people to play with internally. Sadly we haven't sanitized it yet so we can't publish it. For those with internal access:
* https://pivot.wikimedia.org/#edit-history-test is the full history across all wikis. It's a bit hard to understand how to slice and dice, so we will host a tech talk and present it at the January metrics meeting if we can.
* In hive, you can access this data in the wmf database, the tables are: - wmf.mediawiki_history: denormalized full history with this schema https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Mediawiki_history - wmf.mediawiki_page_history: the sequence of states of each wiki page ( schema https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Mediawiki_page_history ) - wmf.mediawiki_user_history: the sequence of states of each user account (schema https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Mediawiki_user_history )
6. Sanitizing has not moved forward, as we need DBA time and they've been overloaded. We will attempt to restart this effort in Q3.
7. We have begun the design process, we'll share more about this as we go.
Our goals and planning for next quarter support us finishing 4, 5, 7, and 8, so basically putting a UI on top of the data pipeline we have in place, and updating it weekly. We also hope to have good progress on 6, but that depends on collaboration with the DBA team and is harder than we originally imagined.
And remember, voice your opinions about important reports in the current Wikistats here: https://www.mediawiki.org/wiki/Analytics/Wikistats/DumpReports/Future_per_re... (thank you so so much to the many people who already chimed in).
Hi Dan,
Thanks for sharing! Can you (or somebody else) tell me where the ticket for "*7. [ ] Design UI to organize dashboards built around new data*" is? I'd be interested and I may be able to help.
Jan
2016-12-03 17:38 GMT+01:00 Dan Andreescu dandreescu@wikimedia.org:
We're starting to wrap up the calendar year, here's what we've accomplished so far with Wikistats. We're really excited to have some data in our production Hive database for people to play with. We worked really hard to clean up and present an intuitive interface to all of mediawiki history. The results are captured in the tables mentioned below, which we'll cover more in an upcoming tech talk. Documentation for the project is here https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake.
Our goals so far and progress breakdown:
- [done] Build pipeline to process and analyze *pageview* data
- [done] Load pageview data into an *API*
- [ ] *Sanitize* pageview data with more dimensions for public
consumption 4. [ beta] Build pipeline to process and analyze *editing* data 5. [ beta] Load editing data into an *API* 6. [ ] *Sanitize* editing data for public consumption 7. [ ] *Design* UI to organize dashboards built around new data 8. [ ] Build enough *dashboards* to replace the main functionality of stats.wikipedia.org 9. [ ] Officially Replace stats.wikipedia.org with *(maybe) analytics.wikipedia.org http://analytics.wikipedia.org/* ***. [ ] Bonus: *replace dumps generation* based on the new data pipelines
4 & 5. Since our last update, we've finished the pipeline that imports data from mediawiki databases, cleans it up as best as possible, reshapes it in a analytics-friendly way, and makes it easily queryable. I'm marking these goals as "beta" because we're still tweaking the algorithm for performance and productionizing the jobs. This will be completed early next quarter, but in the meantime we have data for people to play with internally. Sadly we haven't sanitized it yet so we can't publish it. For those with internal access:
- https://pivot.wikimedia.org/#edit-history-test is the full history
across all wikis. It's a bit hard to understand how to slice and dice, so we will host a tech talk and present it at the January metrics meeting if we can.
- In hive, you can access this data in the wmf database, the tables are:
- wmf.mediawiki_history: denormalized full history with this schema
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Mediawiki_history - wmf.mediawiki_page_history: the sequence of states of each wiki page (schema https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Mediawiki_page_history ) - wmf.mediawiki_user_history: the sequence of states of each user account (schema https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Mediawiki_user_history )
- Sanitizing has not moved forward, as we need DBA time and they've been
overloaded. We will attempt to restart this effort in Q3.
- We have begun the design process, we'll share more about this as we go.
Our goals and planning for next quarter support us finishing 4, 5, 7, and 8, so basically putting a UI on top of the data pipeline we have in place, and updating it weekly. We also hope to have good progress on 6, but that depends on collaboration with the DBA team and is harder than we originally imagined.
And remember, voice your opinions about important reports in the current Wikistats here: https://www.mediawiki.org/wiki/Analytics/Wikistats/ DumpReports/Future_per_report (thank you so so much to the many people who already chimed in).
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
The master ticket for UI design is: https://phabricator.wikimedia.org/T140000 *
* yes, I waited until phabricator hit an even number and grabbed it for this project :)
On Mon, Dec 5, 2016 at 3:07 AM, Jan Dittrich jan.dittrich@wikimedia.de wrote:
Hi Dan,
Thanks for sharing! Can you (or somebody else) tell me where the ticket for "*7. [ ] Design UI to organize dashboards built around new data*" is? I'd be interested and I may be able to help.
Jan
2016-12-03 17:38 GMT+01:00 Dan Andreescu dandreescu@wikimedia.org:
We're starting to wrap up the calendar year, here's what we've accomplished so far with Wikistats. We're really excited to have some data in our production Hive database for people to play with. We worked really hard to clean up and present an intuitive interface to all of mediawiki history. The results are captured in the tables mentioned below, which we'll cover more in an upcoming tech talk. Documentation for the project is here https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake.
Our goals so far and progress breakdown:
- [done] Build pipeline to process and analyze *pageview* data
- [done] Load pageview data into an *API*
- [ ] *Sanitize* pageview data with more dimensions for public
consumption 4. [ beta] Build pipeline to process and analyze *editing* data 5. [ beta] Load editing data into an *API* 6. [ ] *Sanitize* editing data for public consumption 7. [ ] *Design* UI to organize dashboards built around new data 8. [ ] Build enough *dashboards* to replace the main functionality of stats.wikipedia.org 9. [ ] Officially Replace stats.wikipedia.org with *(maybe) analytics.wikipedia.org http://analytics.wikipedia.org/* ***. [ ] Bonus: *replace dumps generation* based on the new data pipelines
4 & 5. Since our last update, we've finished the pipeline that imports data from mediawiki databases, cleans it up as best as possible, reshapes it in a analytics-friendly way, and makes it easily queryable. I'm marking these goals as "beta" because we're still tweaking the algorithm for performance and productionizing the jobs. This will be completed early next quarter, but in the meantime we have data for people to play with internally. Sadly we haven't sanitized it yet so we can't publish it. For those with internal access:
- https://pivot.wikimedia.org/#edit-history-test is the full history
across all wikis. It's a bit hard to understand how to slice and dice, so we will host a tech talk and present it at the January metrics meeting if we can.
- In hive, you can access this data in the wmf database, the tables are:
- wmf.mediawiki_history: denormalized full history with this schema
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Mediawiki_history - wmf.mediawiki_page_history: the sequence of states of each wiki page (schema https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Mediawiki_page_history ) - wmf.mediawiki_user_history: the sequence of states of each user account (schema https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Mediawiki_user_history )
- Sanitizing has not moved forward, as we need DBA time and they've
been overloaded. We will attempt to restart this effort in Q3.
- We have begun the design process, we'll share more about this as we
go.
Our goals and planning for next quarter support us finishing 4, 5, 7, and 8, so basically putting a UI on top of the data pipeline we have in place, and updating it weekly. We also hope to have good progress on 6, but that depends on collaboration with the DBA team and is harder than we originally imagined.
And remember, voice your opinions about important reports in the current Wikistats here: https://www.mediawiki.org/wiki/Analytics/Wikistats/Dum pReports/Future_per_report (thank you so so much to the many people who already chimed in).
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Jan Dittrich UX Design/ User Research
Wikimedia Deutschland e.V. | Tempelhofer Ufer 23-24 | 10963 Berlin Phone: +49 (0)30 219 158 26-0 http://wikimedia.de
Imagine a world, in which every single human being can freely share in the sum of all knowledge. That‘s our commitment.
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
wiki-research-l@lists.wikimedia.org