Dear Wikimedia analytics team,
We are 3 master students from Vrije Universiteit Amsterdam (VU) and Universtity of
Amsterdam (UVA) doing a large scale data engineering project about detecting DDOS attacks
on Wikipedia by analysing page views and traffic and trying to distinguish e.g. DDOS
attacks from trending topics.
For this project, we need a lot of data. We found two sources of public data, Pageview
complete (
https://dumps.wikimedia.org/other/pageview_complete/) and the filtered version
thereof (
https://dumps.wikimedia.org/other/pageviews/). While these dumps are already
quite useful, we also found that there is a dataset with even more information
(
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_ho…), in
particular it contains the country a pageview came from and the referer, which could both
be very useful for our project.
According to the above page, this dataset has been made private since 2018. We would like
to ask whether it is possible to have access to this dataset for our research, or any
other extended version of the public dump, which would enable us to do more in-depth
research. We have our own cluster so we could work on a copy of the data. Moreover we
would like to share our project and all our results with you to help contribute to your
security measures.
Best regards,
Charel Felten, Gilles Magalhaes and Aleksander Janczewski