Hi!  And thanks for the question.  The pageview hourly dataset includes sensitive data and our policy does not allow moving it outside servers we manage.  To work with it, you would have to apply for a formal collaboration via 
https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations

I do wonder if for cases like this we could establish some kind of lighter weight process whereby you practice on some sample data and then submit a proposal for a data dump for public review.  Once it's reviewed by enough people, which could take a while, we could in theory just run the code and publish the data somewhere.  I'll talk to my team about this later today and write back here.  This would only work if the results of the query preserved the anonimity of our users, but I think DDoS research should probably fall in that category.

On Thu, Sep 30, 2021 at 05:30 Charel Felten via Analytics <analytics@lists.wikimedia.org> wrote:
Dear Wikimedia analytics team,

We are 3 master students from Vrije Universiteit Amsterdam (VU) and Universtity of Amsterdam (UVA) doing a large scale data engineering project about detecting DDOS attacks on Wikipedia by analysing page views and traffic and trying to distinguish e.g. DDOS attacks from trending topics.

For this project, we need a lot of data. We found two sources of public data, Pageview complete (https://dumps.wikimedia.org/other/pageview_complete/) and the filtered version thereof (https://dumps.wikimedia.org/other/pageviews/). While these dumps are already quite useful, we also found that there is a dataset with even more information (https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly), in particular it contains the country a pageview came from and the referer, which could both be very useful for our project.

According to the above page, this dataset has been made private since 2018. We would like to ask whether it is possible to have access to this dataset for our research, or any other extended version of the public dump, which would enable us to do more in-depth research. We have our own cluster so we could work on a copy of the data. Moreover we would like to share our project and all our results with you to help contribute to your security measures.

Best regards,
Charel Felten, Gilles Magalhaes and Aleksander Janczewski
_______________________________________________
Analytics mailing list -- analytics@lists.wikimedia.org
To unsubscribe send an email to analytics-leave@lists.wikimedia.org