I’m very excited to share some updates from ops on analytics-store.eqiad.wmnet [1] aka “the one box to rule them all”.
This box (which you access with the “research" SQL credentials) gives you:
1) read access to replicas of all production DBs consolidated on a single machine 2) read access to all EventLogging data via the log DB 3) read/write access to a shared staging DB that can be used as scratch space for temporary tables (similar to the staging DB on s1-analytics). If you create tables on staging, please prefix them with your shell user id (e.g. dartar_foo).
This is one of the best news I got from ops since I joined WMF and it will make my work way easier – thanks Sean and anybody else who helped make this happen.
Ops is also working on a solution to consolidate all credentials for analytics databases in a single place, via the creation of a “researcher” user group [2]. I’lll send a not one the list when this is completed
Dario
[1] a CNAME for dbstore1002.eqaid.wmnet. [2] https://gerrit.wikimedia.org/r/#/c/136273/
Hi Dario,
Dario Taraborelli schreef op 30-5-2014 19:35:
- read/write access to a shared *staging DB* that can be used as
scratch space for temporary tables (similar to the staging DB on s1-analytics). If you create tables on staging, please prefix them with your shell user id (e.g. dartar_foo).
You might want to start using the toolserver/toollabs convention that if you add _p database, it can be viewed by anyone. That way you can mark databases that don't contain private information and might be opened up to more people in the future.
Maarten
Hi Maarten
- read/write access to a shared staging DB that can be used as scratch space for temporary tables (similar to the staging DB on s1-analytics). If you create tables on staging, please prefix them with your shell user id (e.g. dartar_foo).
You might want to start using the toolserver/toollabs convention that if you add _p database, it can be viewed by anyone. That way you can mark databases that don't contain private information and might be opened up to more people in the future.
in fact, on s1-analytics we have two separate databases: • “staging” is a sandbox for researchers to store all kind of temporary datasets, many of which are not meant to be permanently retained or documented • “prod” is meant to host well-documented datasets that do not contain private information and are kosher for publication
We have several projects in the pipeline to generate datasets of analytics interest and that we would like to expose to labs, these include: • a master dataset of total monthly contributions by user by namespace by project https://trello.com/c/3ecjp9aM/237-master-monthly-editor-activity-data • a curated dataset of historical user registration times https://trello.com/c/NB1WO9fM/315-historical-user-registration-data • a dataset with revert metadata https://trello.com/c/FZd4UIcR/29-revert-tracking-and-revert-dump-generation
We also have specs for new server-side logs that will track in a clean way page creations, page moves and page deletions: https://trello.com/c/aKzWq1e3/259-create-schemas-for-page-creation-moves-and...
Finally, we’re discussing how to expose to labs existing EventLogging schemas that include public data that should be made publicly available. I don’t have a definite ETA for each of these projects, but I’ll make sure we post announcements on the lists as soon as new data becomes available.
Dario