Anthony wrote:
On Thu, Apr 22, 2010 at 6:31 PM, Platonides <Platonides@gmail.com mailto:Platonides@gmail.com> wrote:
S. Nunes wrote: > Hi all, > > I presume that Wikipedia keeps data about HTTP accesses to all articles. > Can anybody inform me if this data is available for research purposes? No. With the amount of traffic it has, space needs would be immense, and Wikimedia is not interested in logging all accesses.
What kind of space needs are we talking about?
100k requests per second. Assuming that an url is 50 bytes on average, that's 432 GB per day (the usual apache log line is about 1.5 times that). Most requests are handled by the squids so the backing servers are not even aware of them. Tim Starling had to made a patch to squid in order to register the articles accessed (ie. the data at domas wikistats).
I find it hard to imagine that the other top 10 websites aren't keeping this information.
They probably store it aggregated and/or just a sample.
Shouldn't you be logging every access, at least for a few days, in case of some sort of security breach?
You would need to a) Detect that there is a security breach. b) Find what produced the security breach in that log.
What if your referer was your facebook personal page leaking your full real name?
And what if you're in the sample? I find it quite inappropriate that even sampled data like this is being released.
The referer is not stored anywhere.