Re: [Wiki-research-l] Access to HTTP access logs for Wikipedia articles?

26 Apr 2010

      Anthony wrote:
...
On Thu, Apr 22, 2010 at 6:31 PM, Platonides <Platonides@gmail.com
mailto:Platonides@gmail.com> wrote:
S. Nunes wrote:
> Hi all,
>
> I presume that Wikipedia keeps data about HTTP accesses to all
articles.
> Can anybody inform me if this data is available for research purposes?

No. With the amount of traffic it has, space needs would be immense, and
Wikimedia is not interested in logging all accesses.

What kind of space needs are we talking about?
100k requests per second.
Assuming that an url is 50 bytes on average, that's 432 GB per day (the
usual apache log line is about 1.5 times that).
Most requests are handled by the squids so the backing servers are not
even aware of them. Tim Starling had to made a patch to squid in order
to register the articles accessed (ie. the data at domas wikistats).
...
I find it hard to imagine that the other top 10 websites aren't keeping 
this information.
They probably store it aggregated and/or just a sample.
...
Shouldn't you be logging every access, at least for a few days, in case
of some sort of security breach?
You would need to
a) Detect that there is a security breach.
b) Find what produced the security breach in that log.
...
What if your referer was your facebook personal page leaking your full
real name?

And what if you're in the sample?  I find it quite inappropriate that
even sampled data like this is being released.
The referer is not stored anywhere.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wiki-research-l] Access to HTTP access logs for Wikipedia articles?