Anthony wrote:
On Thu, Apr 22, 2010 at 6:31 PM, Platonides
<Platonides(a)gmail.com
<mailto:Platonides@gmail.com>> wrote:
S. Nunes wrote:
Hi all,
I presume that Wikipedia keeps data about HTTP accesses to all
articles.
Can anybody inform me if this data is available
for research purposes?
No. With the amount of traffic it has, space needs would be immense, and
Wikimedia is not interested in logging all accesses.
What kind of space needs are we talking about?
100k requests per second.
Assuming that an url is 50 bytes on average, that's 432 GB per day (the
usual apache log line is about 1.5 times that).
Most requests are handled by the squids so the backing servers are not
even aware of them. Tim Starling had to made a patch to squid in order
to register the articles accessed (ie. the data at domas wikistats).
I find it hard to imagine that the other top 10
websites aren't keeping
this information.
They probably store it aggregated and/or just a sample.
Shouldn't you be logging every access, at least
for a few days, in case
of some sort of security breach?
You would need to
a) Detect that there is a security breach.
b) Find what produced the security breach in that log.
What if your referer was your facebook personal
page leaking your full
real name?
And what if you're in the sample? I find it quite inappropriate that
even sampled data like this is being released.
The referer is not stored anywhere.