This is Leila from Research. Thanks for reaching out.
Access to non-public data through the Research team happens if we create a
formal research collaboration with you and your team. Whether a formal
collaboration can be created is a function of some requirements to be met
 and our capacity in the Research team. At the moment, our capacity is
very limited and you have a specific research question in mind that you
want to address. Unfortunately, I don't see a way for us to be able to
accommodate your request at this point.
As Dan said, we will be looking for improving our algorithms for bot
detection in the next 3-6 months. If you'd like to be informed about that
research when we are closer to pick that up and you're interested to
collaborate with us there, please ping me off-list and we will get in touch
with you in some months about that research.
I'm sorry that we cannot be of more help for your research at this point.
are met and
Senior Research Scientist
On Mon, Mar 6, 2017 at 6:28 AM, Dan Andreescu <dandreescu(a)wikimedia.org>
Hi Genevieve & Jelena,
We have a process for working with external researchers, and it starts
It certainly sounds like the data we have could help you. We have some
requirements listed there and your project should get the approval of the
research team. You can also check out what other Research projects are
We (the Analytics engineering team) are very interested in bot detection
as well. It might be useful to collaborate. We have several important use
cases for which we need to distinguish bot activity from human activity,
and we were planning on starting that work within one or two quarters.
On Thu, Mar 2, 2017 at 7:15 PM, Genevieve Bartlett <bartlett(a)isi.edu>
Hi All -
Emanuele Rocca suggested we reach out to you guys and see if you guys
would be willing to share web log/content access data.
Jelena and I are network security researchers at University of Southern
California's Information Sciences Institute. We're working on a project for
application-level DDoS defences, and are evaluating our defences for web
Our defences model how legitimate users interact with served content and
using these models we attempt to differentiate between legitimate users and
any attacking bots during high-load (ie a potential attack). Our models are
based on the timing between user requests and the semantic connections (or
lack there of) between content requests. More information on our NSF funded
project can be found here: https://steel.isi.edu/Projects/frade/
Right now, to collect data for evaluation, we've mirrored several sites
(Wikipedia is one of them :) and hired ~200 users to interact with our
mirrored sites (for app-level attack data, we simulate attacks). Of course,
this isn't the most ideal way of getting data on human-content interaction
and we would be thrilled to augment our evaluation with "real world" data.
Wikipedia is particularly of interest to us given the number of "good"
bots which access content and who's access patterns may not exist in our
current models trained on human interactions with mirrored wikipedia
We would be extremely grateful for any information you are willing to
share. We understand and fully support the need to preserve privacy of
Wikipedia and Wikipedia users, and we regularly work with anonymized
datasets. If there's any need for NDAs or similar agreements, we are
very open to whatever is necessary. In addition to web/access logs, any
information on application-level DoS attacks or flash crowds you have
experienced we would be grateful for as well.
Analytics mailing list
Analytics mailing list