Access to HTTP access logs for Wikipedia articles?

List overview All Threads
Download

newer

older

Wikimania 2010: Call for...

2nd CfP Special Issue on...

S. Nunes

13 Apr 2010 13 Apr '10

4:53 p.m.

Hi all,

I presume that Wikipedia keeps data about HTTP accesses to all articles. Can anybody inform me if this data is available for research purposes?

I am particularly interested in HTTP referral information for each article. I suspect that this information could be used to estimate topical relevance for each document. Access to this information poses no risk to users' privacy since no user information is made available - sessions' id, hour/minute timestamp data and IPs could be easily discarded.

I am new to this list, so I really don't know if this has been previously discussed. I searched the archives and found no relevant results on this issue.

Thanks in advance for your feedback, -- Sérgio Nunes

Show replies by date

Felipe Ortega

13 Apr 13 Apr

8:39 p.m.

New subject: Access to HTTP access logs for Wikipedia articles?

Hi Sérgio,

Some universities (like ours) receive a 1/100 sample of the whole set of petitions processed by Wikimedia Squid servers.

It is provided on direct request, however. As far as I know the data is not consistently archived in a public repository anywhere (but I maybe unaware of some system storing that info).

Some work has already been published on this topic:

* A. J. Reinoso, J. M. Gonzalez-Barahona, G. Robles, and F. Ortega, "A quantitative approach to the use of the wikipedia," in 2009 IEEE Symposium on Computers and Communications. IEEE, July 2009, pp. 56-61. [Online]. Available: http://dx.doi.org/10.1109/ISCC.2009.5202401

Regards, Felipe.

--- El mar, 13/4/10, S. Nunes snunes@gmail.com escribió:

...

De: S. Nunes snunes@gmail.com Asunto: [Wiki-research-l] Access to HTTP access logs for Wikipedia articles? Para: "Wikipedia Research List" wiki-research-l@lists.wikimedia.org Fecha: martes, 13 de abril, 2010 13:23 Hi all,

I presume that Wikipedia keeps data about HTTP accesses to all articles. Can anybody inform me if this data is available for research purposes?

I am particularly interested in HTTP referral information for each article. I suspect that this information could be used to estimate topical relevance for each document. Access to this information poses no risk to users' privacy since no user information is made available

sessions' id, hour/minute timestamp data and IPs could be

easily discarded.

I am new to this list, so I really don't know if this has been previously discussed. I searched the archives and found no relevant results on this issue.

Thanks in advance for your feedback,

Sérgio Nunes

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Reid Priedhorsky

8:55 p.m.

New subject: Access to HTTP access logs for Wikipedia articles?

On 04/13/10 10:09, Felipe Ortega wrote:

...

Hi Sérgio,

Some universities (like ours) receive a 1/100 sample of the whole set of petitions processed by Wikimedia Squid servers.

It is provided on direct request, however. As far as I know the data is not consistently archived in a public repository anywhere (but I maybe unaware of some system storing that info).

At UMN, we have a 1/10 sample going back to spring 2007. We're happy to share, but it's pretty unwieldy and other resources are often sufficient.

We don't have the referrer data, though. IIRC it's just timestamp, requested URL.

Reid

S. Nunes

9:16 p.m.

New subject: Access to HTTP access logs for Wikipedia articles?

Thanks for the quick feedback.

Can you tell me to whom should this 'direct request' be addressed? A 1/100 sample or similar would be great. Is referral data included in this sample?

Regards, -- Sérgio Nunes

On 13 April 2010 16:09, Felipe Ortega glimmer_phoenix@yahoo.es wrote:

...

Hi Sérgio,

Some universities (like ours) receive a 1/100 sample of the whole set of petitions processed by Wikimedia Squid servers.

It is provided on direct request, however. As far as I know the data is not consistently archived in a public repository anywhere (but I maybe unaware of some system storing that info).

Some work has already been published on this topic:

A. J. Reinoso, J. M. Gonzalez-Barahona, G. Robles, and F. Ortega, "A quantitative approach to the use of the wikipedia," in 2009 IEEE Symposium on Computers and Communications. IEEE, July 2009, pp. 56-61. [Online]. Available: http://dx.doi.org/10.1109/ISCC.2009.5202401

Regards, Felipe.

--- El mar, 13/4/10, S. Nunes snunes@gmail.com escribió:

...
De: S. Nunes snunes@gmail.com Asunto: [Wiki-research-l] Access to HTTP access logs for Wikipedia articles? Para: "Wikipedia Research List" wiki-research-l@lists.wikimedia.org Fecha: martes, 13 de abril, 2010 13:23 Hi all,

I presume that Wikipedia keeps data about HTTP accesses to all articles. Can anybody inform me if this data is available for research purposes?

I am particularly interested in HTTP referral information for each article. I suspect that this information could be used to estimate topical relevance for each document. Access to this information poses no risk to users' privacy since no user information is made available

sessions' id, hour/minute timestamp data and IPs could be

easily discarded.

I am new to this list, so I really don't know if this has been previously discussed. I searched the archives and found no relevant results on this issue.

Thanks in advance for your feedback,

Sérgio Nunes

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Martin Hellberg Olsson

10:44 p.m.

New subject: Access to HTTP access logs for Wikipedia articles?

A reply to the whole discussion, and at least one other recent one, rather than this last question.

This probably doesn't do everything that any of the people asking need, but should be relevant. I'm a bit surprised it hasn't been mentioned - at least the people involved in these should be able to advise, even if the online data isn't usable.

Wikipedia article traffic statistics: http://stats.grok.se/ is a "mere visualizer" for the raw data available here: http://dammit.lt/wikistats/ as stated in the visualizer's FAQ: http://stats.grok.se/about " Domas Mituzas http://dammit.lt put together a system to gather access statistics from wikipedia's squid cluster and publishes it here http://dammit.lt/wikistats/. This site is a mere visualizer of that data."

Very happy if this can be of any help! Martin

S. Nunes wrote:

...

Thanks for the quick feedback.

Can you tell me to whom should this 'direct request' be addressed? A 1/100 sample or similar would be great. Is referral data included in this sample?

Regards,

Sérgio Nunes

On 13 April 2010 16:09, Felipe Ortega glimmer_phoenix@yahoo.es wrote:

...
Hi Sérgio,

Some universities (like ours) receive a 1/100 sample of the whole set of petitions processed by Wikimedia Squid servers.

It is provided on direct request, however. As far as I know the data is not consistently archived in a public repository anywhere (but I maybe unaware of some system storing that info).

Some work has already been published on this topic:

A. J. Reinoso, J. M. Gonzalez-Barahona, G. Robles, and F. Ortega, "A quantitative approach to the use of the wikipedia," in 2009 IEEE Symposium on Computers and Communications. IEEE, July 2009, pp. 56-61. [Online]. Available: http://dx.doi.org/10.1109/ISCC.2009.5202401

Regards, Felipe.

--- El mar, 13/4/10, S. Nunes snunes@gmail.com escribió:

...
De: S. Nunes snunes@gmail.com Asunto: [Wiki-research-l] Access to HTTP access logs for Wikipedia articles? Para: "Wikipedia Research List" wiki-research-l@lists.wikimedia.org Fecha: martes, 13 de abril, 2010 13:23 Hi all,

I presume that Wikipedia keeps data about HTTP accesses to all articles. Can anybody inform me if this data is available for research purposes?

I am particularly interested in HTTP referral information for each article. I suspect that this information could be used to estimate topical relevance for each document. Access to this information poses no risk to users' privacy since no user information is made available

sessions' id, hour/minute timestamp data and IPs could be

easily discarded.

I am new to this list, so I really don't know if this has been previously discussed. I searched the archives and found no relevant results on this issue.

Thanks in advance for your feedback,

Sérgio Nunes

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Felipe Ortega

14 Apr 14 Apr

1:15 a.m.

New subject: Access to HTTP access logs for Wikipedia articles?

--- El mar, 13/4/10, Martin Hellberg Olsson Martin.HellbergOlsson@UGent.be escribió:

De: Martin Hellberg Olsson Martin.HellbergOlsson@UGent.be Asunto: Re: [Wiki-research-l] Access to HTTP access logs for Wikipedia articles? Para: "Research into Wikimedia content and communities" wiki-research-l@lists.wikimedia.org Fecha: martes, 13 de abril, 2010 19:14

A reply to the whole discussion, and at least one other recent one, rather than this last question.

Sorry, correct me if I'm wrong, but I think that Domas' dumps only contain info about articles (that is, pages in main namespace) and a summary count of hits for each page visited (so you can say which article is the most visited).

We receive raw data for all namespaces of all Wikimedia projects, so you can get more info parsing the URLs (like different actions requested: view, preview, save...).

Best, Felipe.

Wikipedia article traffic statistics: http://stats.grok.se/

is a "mere visualizer" for the raw data available here: http://dammit.lt/wikistats/

as stated in the visualizer's FAQ: http://stats.grok.se/about

" Domas Mituzas put together a system to gather access statistics from wikipedia's squid cluster and publishes it here. This site is a mere visualizer of that data."

Very happy if this can be of any help!

Martin

S. Nunes wrote:

Thanks for the quick feedback.

Can you tell me to whom should this 'direct request' be addressed? A 1/100 sample or similar would be great. Is referral data included in this sample?

Regards, -- Sérgio Nunes

On 13 April 2010 16:09, Felipe Ortega glimmer_phoenix@yahoo.es wrote:

Hi Sérgio,

Some universities (like ours) receive a 1/100 sample of the whole set of petitions processed by Wikimedia Squid servers.

It is provided on direct request, however. As far as I know the data is not consistently archived in a public repository anywhere (but I maybe unaware of some system storing that info).

Some work has already been published on this topic:

Regards, Felipe.

--- El mar, 13/4/10, S. Nunes snunes@gmail.com escribió:

De: S. Nunes snunes@gmail.com Asunto: [Wiki-research-l] Access to HTTP access logs for Wikipedia articles? Para: "Wikipedia Research List" wiki-research-l@lists.wikimedia.org Fecha: martes, 13 de abril, 2010 13:23 Hi all,

I presume that Wikipedia keeps data about HTTP accesses to all articles. Can anybody inform me if this data is available for research purposes?

I am new to this list, so I really don't know if this has been previously discussed. I searched the archives and found no relevant results on this issue.

Thanks in advance for your feedback, -- Sérgio Nunes

_______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

-- -----Adjunto en línea a continuación----- _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Martin Hellberg Olsson

3:03 a.m.

New subject: Access to HTTP access logs for Wikipedia articles?

Felipe: You're mostly right, and I wouldn't expect to know more about this than you, but it's not only main. You can do things like:

http://stats.grok.se/en/201003/Wikipedia%3AAbout

So all wiki pages, but probably not page histories and such things.

Also, I believed it was calculated on some kind of "full" data while most things that had been discussed here were on things like 1/100th:s. Looking at the about page again, though, I'm not sure this was right.

Still, if Domas doesn't read this list (I have no idea), could it hurt to contact him?

Kind regards, Martin

Citerar "Felipe Ortega" glimmer_phoenix@yahoo.es:

...

--- El mar, 13/4/10, Martin Hellberg Olsson Martin.HellbergOlsson@UGent.be escribió:

De: Martin Hellberg Olsson <Martin.HellbergOlsson@UGent.be Asunto: Re: [Wiki-research-l] Access to HTTP access logs for Wikipedia articles? Para: "Research into Wikimedia content and communities" wiki-research-l@lists.wikimedia.org Fecha: martes, 13 de abril, 2010 19:14

A reply to the whole discussion, and at least one other recent one, rather than this last question.

This probably doesn't do everything that any of the people asking need, but should be relevant. I'm a bit surprised it hasn't been mentioned - at least the people involved in these should be able to advise, even if the online data isn't usable.

Sorry, correct me if I'm wrong, but I think that Domas' dumps only contain info about articles (that is, pages in main namespace) and a summary count of hits for each page visited (so you can say which article is the most visited).

We receive raw data for all namespaces of all Wikimedia projects, so you can get more info parsing the URLs (like different actions requested: view, preview, save...).

Best, Felipe.

Wikipedia article traffic statistics: http://stats.grok.se/

is a "mere visualizer" for the raw data available here: http://dammit.lt/wikistats/

as stated in the visualizer's FAQ: http://stats.grok.se/about

" Domas Mituzas put together a system to gather access statistics from wikipedia's squid cluster and publishes it here. This site is a mere visualizer of that data."

Very happy if this can be of any help!

Martin

S. Nunes wrote:

Thanks for the quick feedback.

Can you tell me to whom should this 'direct request' be addressed? A 1/100 sample or similar would be great. Is referral data included in this sample?

Regards,

Sérgio Nunes

On 13 April 2010 16:09, Felipe Ortega glimmer_phoenix@yahoo.es wrote:
Hi Sérgio,
Some universities (like ours) receive a 1/100 sample of the whole set of petitions processed by Wikimedia Squid servers.

It is provided on direct request, however. As far as I know the data is not consistently archived in a public repository anywhere (but I maybe unaware of some system storing that info).

Some work has already been published on this topic:

A. J. Reinoso, J. M. Gonzalez-Barahona, G. Robles, and F. Ortega,

"A quantitative approach to the use of the wikipedia," in 2009 IEEE Symposium on Computers and Communications. IEEE, July 2009, pp. 56-61. [Online]. Available: http://dx.doi.org/10.1109/ISCC.2009.5202401

Regards, Felipe.

--- El mar, 13/4/10, S. Nunes snunes@gmail.com escribió:
  De: S. Nunes <snunes@gmail.com>
Asunto: [Wiki-research-l] Access to HTTP access logs for Wikipedia articles? Para: "Wikipedia Research List" wiki-research-l@lists.wikimedia.org Fecha: martes, 13 de abril, 2010 13:23 Hi all,

I presume that Wikipedia keeps data about HTTP accesses to all articles. Can anybody inform me if this data is available for research purposes?

I am particularly interested in HTTP referral information for each article. I suspect that this information could be used to estimate topical relevance for each document. Access to this information poses no risk to users' privacy since no user information is made available

sessions' id, hour/minute timestamp data and IPs could be

easily discarded.

I am new to this list, so I really don't know if this has been previously discussed. I searched the archives and found no relevant results on this issue.

Thanks in advance for your feedback,

Sérgio Nunes

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

--

-----Adjunto en línea a continuación-----

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

-- Assistent Vakroep Scandinavistiek en Noordeuropakunde Universiteit Gent Rozier 44 B-9000 Gent +32 9 264 38 04 martin.hellbergolsson@ugent.be

Felipe Ortega

8:36 p.m.

New subject: Access to HTTP access logs for Wikipedia articles?

--- El mar, 13/4/10, Martin Hellberg Olsson martin.hellbergolsson@ugent.be escribió:

...

De: Martin Hellberg Olsson martin.hellbergolsson@ugent.be Asunto: Re: [Wiki-research-l] Access to HTTP access logs for Wikipedia articles? Para: wiki-research-l@lists.wikimedia.org Fecha: martes, 13 de abril, 2010 23:33 Felipe: You're mostly right, and I wouldn't expect to know more about this than you, but it's not only main. You can do things like:

On the contrary, thanks for pointing out this.

I was just guessing from what I've seen in the content of those files. But I just had a quick look, so most probably you're right about that :-).

...

http://stats.grok.se/en/201003/Wikipedia%3AAbout

So all wiki pages, but probably not page histories and such things.

Also, I believed it was calculated on some kind of "full" data while most things that had been discussed here were on things like 1/100th:s. Looking at the about page again, though, I'm not sure this was right.

Still, if Domas doesn't read this list (I have no idea), could it hurt to contact him?

Definitely, it'd be great that the data were already there. Many people is interested in information about traffic, and the more analyses we get, the more we'll eventually know about Wikipedia dynamics.

Domas is always quite busy, but it'd be great to know more about this.

Best, Felipe.

...

Kind regards, Martin

Citerar "Felipe Ortega" glimmer_phoenix@yahoo.es:

...
--- El mar, 13/4/10, Martin Hellberg Olsson Martin.HellbergOlsson@UGent.be

escribió:

...
De: Martin Hellberg Olsson <Martin.HellbergOlsson@UGent.be Asunto: Re: [Wiki-research-l] Access to HTTP access

logs for

...
Wikipedia articles? Para: "Research into Wikimedia content and

communities"

...
wiki-research-l@lists.wikimedia.org Fecha: martes, 13 de abril, 2010 19:14

A reply to the whole discussion, and at least one

other recent one,

...
rather than this last question.

This probably doesn't do everything that any of the

people asking need,

...
but should be relevant. I'm a bit surprised it hasn't

been mentioned -

...
at least the people involved in these should be able

to advise, even if

...
the online data isn't usable.

Sorry, correct me if I'm wrong, but I think that

Domas' dumps only

...
contain info about articles (that is, pages in main

namespace) and a

...
summary count of hits for each page visited (so you

can say which

...
article is the most visited).

We receive raw data for all namespaces of all

Wikimedia projects, so

...
you can get more info parsing the URLs (like different

actions

...
requested: view, preview, save...).

Best, Felipe.

Wikipedia article traffic statistics: http://stats.grok.se/

is a "mere visualizer" for the raw data available

here:

...
http://dammit.lt/wikistats/

as stated in the visualizer's FAQ: http://stats.grok.se/about

" Domas Mituzas put together a system to gather access statistics from wikipedia's squid

cluster and

...
publishes it here. This site is a mere visualizer of that data."

Very happy if this can be of any help!

Martin

S. Nunes wrote:

Thanks for the quick feedback.

Can you tell me to whom should this 'direct request'

be addressed?

...
A 1/100 sample or similar would be great. Is referral

data included in

...
this sample?

Regards,

Sérgio Nunes

On 13 April 2010 16:09, Felipe Ortega glimmer_phoenix@yahoo.es

wrote:

...
Hi Sérgio,

Some universities (like ours) receive a 1/100 sample

of the whole

...
set of petitions processed by Wikimedia Squid

servers.

...
It is provided on direct request, however. As far as I

know the data

...
is not consistently archived in a public repository

anywhere (but I

...
maybe unaware of some system storing that info).

Some work has already been published on this topic:

A. J. Reinoso, J. M. Gonzalez-Barahona, G. Robles,

and F. Ortega,

...
"A quantitative approach to the use of the wikipedia,"

in 2009 IEEE

...
Symposium on Computers and Communications. IEEE,

July 2009, pp.

...
56-61. [Online]. Available: http://dx.doi.org/10.1109/ISCC.2009.5202401

Regards, Felipe.

--- El mar, 13/4/10, S. Nunes snunes@gmail.com

escribió:

...
De: S. Nunes snunes@gmail.com Asunto: [Wiki-research-l] Access to HTTP access logs

for Wikipedia articles?

...
Para: "Wikipedia Research List" wiki-research-l@lists.wikimedia.org Fecha: martes, 13 de abril, 2010 13:23 Hi all,

I presume that Wikipedia keeps data about HTTP

accesses to

...
all articles. Can anybody inform me if this data is available for research purposes?

I am particularly interested in HTTP referral

information

...
for each article. I suspect that this information could be used

to

...
estimate topical relevance for each document. Access to this information poses no risk to users' privacy since no user information is

made

...
available

sessions' id, hour/minute timestamp data and IPs

could be

...
easily discarded.

I am new to this list, so I really don't know if this

has

...
been previously discussed. I searched the archives and found no relevant results

on

...
this issue.

Thanks in advance for your feedback,

Sérgio Nunes

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

--

-----Adjunto en línea a continuación-----

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

-- Assistent Vakroep Scandinavistiek en Noordeuropakunde Universiteit Gent Rozier 44 B-9000 Gent +32 9 264 38 04 martin.hellbergolsson@ugent.be

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

paolo massa

9:46 p.m.

New subject: Access to HTTP access logs for Wikipedia articles?

On Tue, Apr 13, 2010 at 7:14 PM, Martin Hellberg Olsson Martin.HellbergOlsson@ugent.be wrote:

...

Wikipedia article traffic statistics: http://stats.grok.se/ is a "mere visualizer" for the raw data available here: http://dammit.lt/wikistats/ as stated in the visualizer's FAQ: http://stats.grok.se/about

At http://en.wikipedia.org/wiki/Wikipedia:Statistics you can find these links and actually many more. Hope it helps!

-- -- Paolo Massa Email: paolo AT gnuband DOT org Blog: http://gnuband.org

Platonides

23 Apr 23 Apr

4:01 a.m.

New subject: Access to HTTP access logs for Wikipedia articles?

S. Nunes wrote:

...

Hi all,

I presume that Wikipedia keeps data about HTTP accesses to all articles. Can anybody inform me if this data is available for research purposes?

No. With the amount of traffic it has, space needs would be immense, and Wikimedia is not interested in logging all accesses.

You can use domas wikistats if they contain enough data for you. You may get a sampled feed for processing after contacting the foundation.

...

I am particularly interested in HTTP referral information for each article. I suspect that this information could be used to estimate topical relevance for each document.

Just from wikimedia referers, or from all web? How does knowing the page from which they reached wikipedia help to estimate the document relevance?

...

Access to this information poses no risk to users' privacy since no user information is made available

sessions' id, hour/minute timestamp data and IPs could be easily

discarded.

What if your referer was your facebook personal page leaking your full real name? It may be possible to properly anonymize them, but it's not trivial either.

Anthony

24 Apr 24 Apr

7:26 a.m.

New subject: Access to HTTP access logs for Wikipedia articles?

On Thu, Apr 22, 2010 at 6:31 PM, Platonides Platonides@gmail.com wrote:

...

S. Nunes wrote:

...
Hi all,

I presume that Wikipedia keeps data about HTTP accesses to all articles. Can anybody inform me if this data is available for research purposes?

No. With the amount of traffic it has, space needs would be immense, and Wikimedia is not interested in logging all accesses.

What kind of space needs are we talking about? I find it hard to imagine that the other top 10 websites aren't keeping this information. Shouldn't you be logging every access, at least for a few days, in case of some sort of security breach?

...

...
Access to this information poses no risk to users' privacy since no user information is made available

sessions' id, hour/minute timestamp data and IPs could be easily

discarded.

What if your referer was your facebook personal page leaking your full real name?

And what if you're in the sample? I find it quite inappropriate that even sampled data like this is being released.

Reid Priedhorsky

10:16 a.m.

New subject: Access to HTTP access logs for Wikipedia articles?

Anthony wrote:

...

...
...
Access to this information poses no risk to users' privacy since no user information is made available

sessions' id, hour/minute timestamp data and IPs could be easily

discarded.

What if your referer was your facebook personal page leaking your full real name?

And what if you're in the sample? I find it quite inappropriate that even sampled data like this is being released.

It's not. The sample data we get is sequence number, timestamp, URL requested. That's it.

Reid

Platonides

27 Apr 27 Apr

3:22 a.m.

New subject: Access to HTTP access logs for Wikipedia articles?

Anthony wrote:

...

On Thu, Apr 22, 2010 at 6:31 PM, Platonides <Platonides@gmail.com mailto:Platonides@gmail.com> wrote:
S. Nunes wrote:
> Hi all,
>
> I presume that Wikipedia keeps data about HTTP accesses to all
articles.
> Can anybody inform me if this data is available for research purposes?

No. With the amount of traffic it has, space needs would be immense, and
Wikimedia is not interested in logging all accesses.
What kind of space needs are we talking about?

100k requests per second. Assuming that an url is 50 bytes on average, that's 432 GB per day (the usual apache log line is about 1.5 times that). Most requests are handled by the squids so the backing servers are not even aware of them. Tim Starling had to made a patch to squid in order to register the articles accessed (ie. the data at domas wikistats).

...

I find it hard to imagine that the other top 10 websites aren't keeping this information.

They probably store it aggregated and/or just a sample.

...

Shouldn't you be logging every access, at least for a few days, in case of some sort of security breach?

You would need to a) Detect that there is a security breach. b) Find what produced the security breach in that log.

...

What if your referer was your facebook personal page leaking your full
real name?
And what if you're in the sample? I find it quite inappropriate that even sampled data like this is being released.

The referer is not stored anywhere.

Anthony

4:29 a.m.

New subject: Access to HTTP access logs for Wikipedia articles?

On Mon, Apr 26, 2010 at 5:52 PM, Platonides Platonides@gmail.com wrote:

...

Anthony wrote:

...
What kind of space needs are we talking about?

100k requests per second. Assuming that an url is 50 bytes on average, that's 432 GB per day (the usual apache log line is about 1.5 times that).

Seems reasonable. For 3 days of access that's 18 gigs per server over 70 servers.

And that's without compression, and 50 bytes seems awfully long for a URL.

...

...
What if your referer was your facebook personal page leaking your
full

...
real name?
And what if you're in the sample? I find it quite inappropriate that even sampled data like this is being released.
The referer is not stored anywhere.

Well, that's good to hear. What exactly is contained in the sampled data which is being released? We've heard what's in the 1/10th sample Mr. Priedhorsky is getting, but what about the rest?

S. Nunes

26 Apr 26 Apr

3:56 p.m.

New subject: Access to HTTP access logs for Wikipedia articles?

Hi all,

On 22 April 2010 23:31, Platonides Platonides@gmail.com wrote:

...

No. With the amount of traffic it has, space needs would be immense, and Wikimedia is not interested in logging all accesses.

I understand, but I think that they might be discarding relevant meta-information that would enrich Wikipedia. Also, a short sample (i.e. last week) would suffice for many exploratory works.

...

You may get a sampled feed for processing after contacting the foundation.

Can you tell me what is the best way to "contact the foundation" ?

...

How does knowing the page from which they reached wikipedia help to estimate the document relevance?

I'm interested in information from all the web. About document relevance - anchor information is a very valuable signal in web information retrieval. With referral information it would be possible to extract this and link it to the corresponding article. Also, by looking at referral data from Google, Bing or Yahoo, we could identify and use the query terms used to reach each article.

I think that there are some possibilities worth exploring here.

...

What if your referer was your facebook personal page leaking your full real name?

This is a valid concern, but is this a possible scenario? If so, it seems that this could be seen as a FB security breach - if a profile is private, its information should not be passed to others.

Thanks again for all feedback, -- Sérgio Nunes

5355

Age (days ago)

5368

Last active (days ago)

wiki-research-l@lists.wikimedia.org

14 comments

7 participants

tags (0)

participants (7)

Anthony
Felipe Ortega
Martin Hellberg Olsson
paolo massa
Platonides
Reid Priedhorsky
S. Nunes