Hi all,

Who: This mostly applies to people who have access to the stat1002 and stat1003 statistics machines on the production cluster, and publish datasets as static files.

What: We are no longer using datasets.wikimedia.org to serve static datasets.  We have set up a redirect, so requests like https://datasets.wikimedia.org/ $1 will be sent to https://analytics.wikimedia.org/datasets/archive/ $1.  Most importantly, publishing datasets is now much easier.  Any files you put in published-datasets on either machine:

stat1002:/a/published-datasets
stat1003:/srv/published-datasets

Are going to be merged together and served together on:

https://analytics.wikimedia.org/datasets/

One request as we all enjoy this much simpler process: let's use README files in these directories to let future versions of us know what the datasets are all about.  That will make the repository more fun for others to browse and ease future cleanups.  Thank you!


TODO

If something of yours got lost, let us know, we have backups.  If you had stuff that we might have cleaned up, we put it in /srv/otto-to-delete-datasets-cleanup and /a/otto-to-delete-datasets-cleanup.  Take a look there and you can move files as you see fit into published-datasets


Context

For a long time, publishing files from stat1002 and stat1003 was quite painful.  There were three folders, some on both boxes, some only on one box, symlinks, rsyncs, it was bad.  We talked to everyone who had files in these folders and gathered consensus for this deprecation.  If this message catches you by surprise, please let us know what channel we should reach you in next time and we'll add it to our communication plan.

This work is tracked in T159409