Hi All -

Emanuele Rocca suggested we reach out to you guys and see if you guys would be willing to share web log/content access data.

Jelena and I are network security researchers at University of Southern California's Information Sciences Institute. We're working on a project for application-level DDoS defences, and are evaluating our defences for web applications.

Our defences model how legitimate users interact with served content and using these models we attempt to differentiate between legitimate users and any attacking bots during high-load (ie a potential attack). Our models are based on the timing between user requests and the semantic connections (or lack there of) between content requests. More information on our NSF funded project can be found here: https://steel.isi.edu/Projects/frade/

Right now, to collect data for evaluation, we've mirrored several sites (Wikipedia is one of them :) and hired ~200 users to interact with our mirrored sites (for app-level attack data, we simulate attacks). Of course, this isn't the most ideal way of getting data on human-content interaction and we would be thrilled to augment our evaluation with "real world" data.

Wikipedia is particularly of interest to us given the number of "good" bots which access content and who's access patterns may not exist in our current models trained on human interactions with mirrored wikipedia content.

We would be extremely grateful for any information you are willing to share. We understand and fully support the need to preserve privacy of Wikipedia and Wikipedia users, and we regularly work with anonymized datasets. If there's any need for NDAs or similar agreements, we are very open to whatever is necessary. In addition to web/access logs, any information on application-level DoS attacks or flash crowds you have experienced we would be grateful for as well.

cheers,

gen