Hi Genevieve,
This is Leila from Research. Thanks for reaching out.
Access to non-public data through the Research team happens if we create a formal research collaboration with you and your team. Whether a formal collaboration can be created is a function of some requirements to be met [1] and our capacity in the Research team. At the moment, our capacity is very limited and you have a specific research question in mind that you want to address. Unfortunately, I don't see a way for us to be able to accommodate your request at this point.
As Dan said, we will be looking for improving our algorithms for bot detection in the next 3-6 months. If you'd like to be informed about that research when we are closer to pick that up and you're interested to collaborate with us there, please ping me off-list and we will get in touch with you in some months about that research.
I'm sorry that we cannot be of more help for your research at this point.
Best, Leila
[1] https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations#How_... are met and
Leila Zia Senior Research Scientist Wikimedia Foundation
On Mon, Mar 6, 2017 at 6:28 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
Hi Genevieve & Jelena,
We have a process for working with external researchers, and it starts here: https://meta.wikimedia.org/wiki/Research:Access_to_non-public_data
It certainly sounds like the data we have could help you. We have some requirements listed there and your project should get the approval of the research team. You can also check out what other Research projects are happening: https://meta.wikimedia.org/wiki/Research:Projects
We (the Analytics engineering team) are very interested in bot detection as well. It might be useful to collaborate. We have several important use cases for which we need to distinguish bot activity from human activity, and we were planning on starting that work within one or two quarters.
On Thu, Mar 2, 2017 at 7:15 PM, Genevieve Bartlett bartlett@isi.edu wrote:
Hi All -
Emanuele Rocca suggested we reach out to you guys and see if you guys would be willing to share web log/content access data.
Jelena and I are network security researchers at University of Southern California's Information Sciences Institute. We're working on a project for application-level DDoS defences, and are evaluating our defences for web applications.
Our defences model how legitimate users interact with served content and using these models we attempt to differentiate between legitimate users and any attacking bots during high-load (ie a potential attack). Our models are based on the timing between user requests and the semantic connections (or lack there of) between content requests. More information on our NSF funded project can be found here: https://steel.isi.edu/Projects/frade/
Right now, to collect data for evaluation, we've mirrored several sites (Wikipedia is one of them :) and hired ~200 users to interact with our mirrored sites (for app-level attack data, we simulate attacks). Of course, this isn't the most ideal way of getting data on human-content interaction and we would be thrilled to augment our evaluation with "real world" data.
Wikipedia is particularly of interest to us given the number of "good" bots which access content and who's access patterns may not exist in our current models trained on human interactions with mirrored wikipedia content.
We would be extremely grateful for any information you are willing to share. We understand and fully support the need to preserve privacy of Wikipedia and Wikipedia users, and we regularly work with anonymized datasets. If there's any need for NDAs or similar agreements, we are very open to whatever is necessary. In addition to web/access logs, any information on application-level DoS attacks or flash crowds you have experienced we would be grateful for as well.
cheers, gen
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics