Dear Wikitechnicians,
My name is Reid Priedhorsky, and I'm a Ph.D. student at GroupLens Research, which is the human-computer interaction group at the University of Minnesota.
We are currently working on some research which is investigating Wikipedia contribution and vandalism. To this end, statistics on the view rate of different articles would be extremely helpful to us -- something along the lines of Leon Weber's WikiCharts tool, but with a larger limit (ideally all 1.7 million articles).
It seems to me that the easiest way to accomplish this would be to get copies of your sampled Squid logs (as described on http://lists.wikimedia.org/pipermail/wikitech-l/2007-January/029000.html and its links). We do not need the client IP or any other similarly sensitive data, though if you gave it to us we would protect it carefully as we protect the other sensitive research data we handle.
Would it be possible for us to have access to these log files?
If not, I would love to begin a discussion on what it would be possible for us to access.
Your help would be greatly appreciated. Please let me know if you have any questions.
Thanks,
Reid
Greetings, describe for me what you ideal data would look like.
On 3/28/07, Reid Priedhorsky reid@umn.edu wrote:
Dear Wikitechnicians,
My name is Reid Priedhorsky, and I'm a Ph.D. student at GroupLens Research, which is the human-computer interaction group at the University of Minnesota.
We are currently working on some research which is investigating Wikipedia contribution and vandalism. To this end, statistics on the view rate of different articles would be extremely helpful to us -- something along the lines of Leon Weber's WikiCharts tool, but with a larger limit (ideally all 1.7 million articles).
It seems to me that the easiest way to accomplish this would be to get copies of your sampled Squid logs (as described on http://lists.wikimedia.org/pipermail/wikitech-l/2007-January/029000.html and its links). We do not need the client IP or any other similarly sensitive data, though if you gave it to us we would protect it carefully as we protect the other sensitive research data we handle.
Would it be possible for us to have access to these log files?
If not, I would love to begin a discussion on what it would be possible for us to access.
Your help would be greatly appreciated. Please let me know if you have any questions.
Thanks,
Reid
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reid Priedhorsky wrote:
Dear Wikitechnicians,
My name is Reid Priedhorsky, and I'm a Ph.D. student at GroupLens Research, which is the human-computer interaction group at the University of Minnesota.
We are currently working on some research which is investigating Wikipedia contribution and vandalism. To this end, statistics on the view rate of different articles would be extremely helpful to us -- something along the lines of Leon Weber's WikiCharts tool, but with a larger limit (ideally all 1.7 million articles).
Producing such statistics will be a Google Summer of Code project this summer. If you can't wait that long, then we can give you a sampled, anonymised log stream to analyse.
-- Tim Starling
Tim Starling wrote:
Reid Priedhorsky wrote:
Dear Wikitechnicians,
My name is Reid Priedhorsky, and I'm a Ph.D. student at GroupLens Research, which is the human-computer interaction group at the University of Minnesota.
We are currently working on some research which is investigating Wikipedia contribution and vandalism. To this end, statistics on the view rate of different articles would be extremely helpful to us -- something along the lines of Leon Weber's WikiCharts tool, but with a larger limit (ideally all 1.7 million articles).
Producing such statistics will be a Google Summer of Code project this summer. If you can't wait that long, then we can give you a sampled, anonymised log stream to analyse.
Yes, summer would be too late: anonymised logs would be be excellent for our purposes. Does "stream" mean that we would need to write a program to listen to the real-time log stream, or could you give us files?
Gregory Maxwell wrote:
Greetings, describe for me what you ideal data would look like.
Ideal data would be log files that just looked like:
Main Page\t1169499304.066
i.e., article titles as they appear in the XML dumps and request time.
A close second choice would be simply-anonymized logs, e.g.:
sq18.wikimedia.org 1715898 1169499304.066 0 - TCP_MEM_HIT/200 13208 GET http://en.wikipedia.org/wiki/Main_Page NONE/- text/html - - -
If the logs still contains duplicates due to requests being forwarded between squids, we'd need pointers on how to resolve those.
Please let me know what the next step is. Thanks for your help!
Reid
On 29/03/07, Reid Priedhorsky reid@umn.edu wrote:
Tim Starling wrote:
Reid Priedhorsky wrote:
Dear Wikitechnicians,
My name is Reid Priedhorsky, and I'm a Ph.D. student at GroupLens Research, which is the human-computer interaction group at the University of Minnesota.
We are currently working on some research which is investigating Wikipedia contribution and vandalism. To this end, statistics on the view rate of different articles would be extremely helpful to us -- something along the lines of Leon Weber's WikiCharts tool, but with a larger limit (ideally all 1.7 million articles).
Producing such statistics will be a Google Summer of Code project this summer. If you can't wait that long, then we can give you a sampled, anonymised log stream to analyse.
Yes, summer would be too late: anonymised logs would be be excellent for our purposes. Does "stream" mean that we would need to write a program to listen to the real-time log stream, or could you give us files?
Gregory Maxwell wrote:
Greetings, describe for me what you ideal data would look like.
Ideal data would be log files that just looked like:
Main Page\t1169499304.066
i.e., article titles as they appear in the XML dumps and request time.
A close second choice would be simply-anonymized logs, e.g.:
sq18.wikimedia.org 1715898 1169499304.066 0 - TCP_MEM_HIT/200 13208 GET http://en.wikipedia.org/wiki/Main_Page NONE/- text/html - - -
If the logs still contains duplicates due to requests being forwarded between squids, we'd need pointers on how to resolve those.
Please let me know what the next step is. Thanks for your help!
Reid
Just a small aside: please keep us up-to-date on the outcome of the research over on the Wiki-research-l mailing list. It's always interesting (and potentially useful) to see how Wikipedia is used.
On 3/30/07, Oldak Quill oldakquill@gmail.com wrote:
Just a small aside: please keep us up-to-date on the outcome of the research over on the Wiki-research-l mailing list. It's always interesting (and potentially useful) to see how Wikipedia is used.
Yes, I second this. We seem to get quite a few posts of the type "We have been intensively researching some feature of Wikipedia for the last 2 years, and we just need one detail to continue our research". And that's the last we hear of them. I'd love to hear the results of it - it would benefit our project a lot to have some hard statistics.[1]
Steve [1] Is that an oxymoron?
Steve Bennett wrote:
On 3/30/07, Oldak Quill oldakquill@gmail.com wrote:
Just a small aside: please keep us up-to-date on the outcome of the research over on the Wiki-research-l mailing list. It's always interesting (and potentially useful) to see how Wikipedia is used.
Yes, I second this. We seem to get quite a few posts of the type "We have been intensively researching some feature of Wikipedia for the last 2 years, and we just need one detail to continue our research". And that's the last we hear of them. I'd love to hear the results of it - it would benefit our project a lot to have some hard statistics.[1]
Certainly. Our goal is to publish in a standard HCI venue, and those publications are public info. I've put it on my to-do list to send a note to wiki-research-l when our results are available.
Take care,
Reid
p.s. Thanks for the pointer to that list -- I wasn't aware of it, and its content looks quite interesting.
Reid Priedhorsky wrote:
Tim Starling wrote:
Reid Priedhorsky wrote:
Dear Wikitechnicians,
My name is Reid Priedhorsky, and I'm a Ph.D. student at GroupLens Research, which is the human-computer interaction group at the University of Minnesota.
We are currently working on some research which is investigating Wikipedia contribution and vandalism. To this end, statistics on the view rate of different articles would be extremely helpful to us -- something along the lines of Leon Weber's WikiCharts tool, but with a larger limit (ideally all 1.7 million articles).
Producing such statistics will be a Google Summer of Code project this summer. If you can't wait that long, then we can give you a sampled, anonymised log stream to analyse.
Yes, summer would be too late: anonymised logs would be be excellent for our purposes. Does "stream" mean that we would need to write a program to listen to the real-time log stream, or could you give us files?
Gregory Maxwell wrote:
Greetings, describe for me what you ideal data would look like.
Ideal data would be log files that just looked like:
Main Page\t1169499304.066
i.e., article titles as they appear in the XML dumps and request time.
A close second choice would be simply-anonymized logs, e.g.:
sq18.wikimedia.org 1715898 1169499304.066 0 - TCP_MEM_HIT/200 13208 GET http://en.wikipedia.org/wiki/Main_Page NONE/- text/html - - -
If the logs still contains duplicates due to requests being forwarded between squids, we'd need pointers on how to resolve those.
Please let me know what the next step is. Thanks for your help!
Reid
We received a very similar request from Vrije Universiteit, and we're now sending them a 1/10 sampled stream consisting of timestamp and URL, with duplicates removed, real-time via UDP. It would be easier for us if we could send you roughly the same thing. So for example:
1169499304.066 http://en.wikipedia.org/wiki/Main_Page
We don't have any system yet for periodically rotating, analysing and sending logs, so streams are certainly easier for us. We get somewhere on the order of 1.5 billion requests per day, and the simplified log line above has an average length of 97 bytes, so it's an unsampled data rate of about 135 GB per day. You'll probably want us to sample that down before we send it to you.
Extracting the title as it appears in the XML dump is just a matter of finding the right part of the URL and then unescaping it.
You can contact me privately to get the technical details sorted out.
-- Tim Starling
On 4/1/07, Tim Starling tstarling@wikimedia.org wrote:
sending logs, so streams are certainly easier for us. We get somewhere on the order of 1.5 billion requests per day, and the simplified log line
Can I be the first to say "holy crap!"
Stvee
On 04/04/07, Steve Bennett stevagewp@gmail.com wrote:
Can I be the first to say "holy crap!"
You're really that shocked?
Rob Church
On 4/4/07, Rob Church robchur@gmail.com wrote:
On 04/04/07, Steve Bennett stevagewp@gmail.com wrote:
Can I be the first to say "holy crap!"
You're really that shocked?
Had I sat down to think about it, perhaps, perhaps not. But I've never heard of a daily pageview figure expressed in *billions* before.
I had a webpage once. It got 400 pageviews in a year.
Steve
2007/4/4, Steve Bennett stevagewp@gmail.com:
On 4/4/07, Rob Church robchur@gmail.com wrote:
On 04/04/07, Steve Bennett stevagewp@gmail.com wrote:
Can I be the first to say "holy crap!"
You're really that shocked?
Had I sat down to think about it, perhaps, perhaps not. But I've never heard of a daily pageview figure expressed in *billions* before.
It's not count of pageviews, but http requests, is it?
AJF/WarX
On 04/04/07, Artur Fijałkowski wiki.warx@gmail.com wrote:
It's not count of pageviews, but http requests, is it?
That's right.
Rob Church
On 4/4/07, Rob Church robchur@gmail.com wrote:
On 04/04/07, Artur Fijałkowski wiki.warx@gmail.com wrote:
It's not count of pageviews, but http requests, is it?
That's right.
Ah, ok. So a page with a hundred images is like 105 http requests including CSS etc, but only one page view.
Steve
On Thu, 2007-05-04 at 11:09 +1000, Steve Bennett wrote:
On 4/4/07, Rob Church robchur@gmail.com wrote:
On 04/04/07, Artur Fijakowski wiki.warx@gmail.com wrote:
It's not count of pageviews, but http requests, is it?
That's right.
Ah, ok. So a page with a hundred images is like 105 http requests including CSS etc, but only one page view.
It gets kind of complicated, since CSS and JS files, as well as skin images, are usually pretty well cached. We do about a 4-to-1 ratio of hits-to-pages on Wikitravel; I'd be surprised if that varied by more than 2x in either direction for Wikipedia.
-Evan
________________________________________________________________________ Evan Prodromou evan@prodromou.name http://evan.prodromou.name/
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Evan Prodromou wrote:
It gets kind of complicated, since CSS and JS files, as well as skin images, are usually pretty well cached. We do about a 4-to-1 ratio of hits-to-pages on Wikitravel; I'd be surprised if that varied by more than 2x in either direction for Wikipedia.
-Evan
Which of these statistics has relevance, and at what degree of granularity? The USA Today readers are looking for something as simple as "# of daily page views" which any surfer can appreciate. The http request tally makes sense to developers who are concerned about the load on our servers, tweaks in performance, etc. Marketers want uniques per day or month, etc.
Which of these stats should be developed to give accurate information to the world about what performance is being achieved, in an apples-to-apples comparison to existing suites? What will give WMF the most credibility in reporting in the future?
(and a HUGE thank you to Tim for making this happen)
On Thu, 2007-05-04 at 10:03 -0400, Brad Patrick wrote:
Which of these statistics has relevance, and at what degree of granularity? The USA Today readers are looking for something as simple as "# of daily page views" which any surfer can appreciate. The http request tally makes sense to developers who are concerned about the load on our servers, tweaks in performance, etc. Marketers want uniques per day or month, etc.
That's about it: page views per day, hits per day, and unique visitors per month are the three main stats people care about.
Which of these stats should be developed to give accurate information to the world about what performance is being achieved, in an apples-to-apples comparison to existing suites? What will give WMF the most credibility in reporting in the future?
Page views per day, I think.
-Evan
________________________________________________________________________ Evan Prodromou evan@prodromou.name http://evan.prodromou.name/
On 4/6/07, Evan Prodromou evan@prodromou.name wrote:
Which of these stats should be developed to give accurate information to the world about what performance is being achieved, in an apples-to-apples comparison to existing suites? What will give WMF the most credibility in reporting in the future?
Page views per day, I think.
Better be a bit more granular than that. Article views, edit-related views, RC views, front-page views, etc. As Yahoo! found out recently, "page views per day" tends to drop rather noticeably when you Ajax away some of the unnecessary ones. I would say it's not a very useful statistic.
On Wed, Apr 04, 2007 at 01:30:57AM +0100, Rob Church wrote:
On 04/04/07, Steve Bennett stevagewp@gmail.com wrote:
Can I be the first to say "holy crap!"
You're really that shocked?
A couple of years ago, when I was temporarily in the running for Local Hands (I live about 20 miles west of the datacenter), we were *just starting* to bump our heads on a 100Mb/s port.
So *I* was a bit shocked. :-)
Cheers, -- jra
wikitech-l@lists.wikimedia.org