Hi Genevieve & Jelena,

We have a process for working with external researchers, and it starts here: https://meta.wikimedia.org/wiki/Research:Access_to_non-public_data

It certainly sounds like the data we have could help you. We have some requirements listed there and your project should get the approval of the research team. You can also check out what other Research projects are happening: https://meta.wikimedia.org/wiki/Research:Projects

We (the Analytics engineering team) are very interested in bot detection as well. It might be useful to collaborate. We have several important use cases for which we need to distinguish bot activity from human activity, and we were planning on starting that work within one or two quarters.

On Thu, Mar 2, 2017 at 7:15 PM, Genevieve Bartlett <bartlett@isi.edu> wrote:

Hi All -

Emanuele Rocca suggested we reach out to you guys and see if you guys would be willing to share web log/content access data.

Jelena and I are network security researchers at University of Southern California's Information Sciences Institute. We're working on a project for application-level DDoS defences, and are evaluating our defences for web applications.

Our defences model how legitimate users interact with served content and using these models we attempt to differentiate between legitimate users and any attacking bots during high-load (ie a potential attack). Our models are based on the timing between user requests and the semantic connections (or lack there of) between content requests. More information on our NSF funded project can be found here: https://steel.isi.edu/Projects/frade/

Right now, to collect data for evaluation, we've mirrored several sites (Wikipedia is one of them :) and hired ~200 users to interact with our mirrored sites (for app-level attack data, we simulate attacks). Of course, this isn't the most ideal way of getting data on human-content interaction and we would be thrilled to augment our evaluation with "real world" data.

Wikipedia is particularly of interest to us given the number of "good" bots which access content and who's access patterns may not exist in our current models trained on human interactions with mirrored wikipedia content.

We would be extremely grateful for any information you are willing to share. We understand and fully support the need to preserve privacy of Wikipedia and Wikipedia users, and we regularly work with anonymized datasets. If there's any need for NDAs or similar agreements, we are very open to whatever is necessary. In addition to web/access logs, any information on application-level DoS attacks or flash crowds you have experienced we would be grateful for as well.

cheers,
gen

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics