Hi folks,
As I mentioned a week or two ago, I am trying to model historical article view rates of English Wikipedia. One of the sources of data for this is:
http://stats.wikimedia.org/EN/TablesUsagePageRequest.htm
I was wondering what the specific definition of "page request" is in this context, in a way that I can replicate the count on current data using the sampled access logs that I have.
The "definition" link on the above says that a page is a URL "... that would be considered the actual page being requested, and not all of the individual items that make it up (such as graphics and audio clips). Some people call this metric page views or page impressions, and defaults to any URL that has an extension of .htm, .html or .cgi." This seems incongruent with the URLs used at Wikipedia.
Any help would be greatly appreciated. Please let me know if you have any questions.
Reid
Reid Priedhorsky wrote:
Hi folks,
As I mentioned a week or two ago, I am trying to model historical article view rates of English Wikipedia. One of the sources of data for this is:
http://stats.wikimedia.org/EN/TablesUsagePageRequest.htm
I was wondering what the specific definition of "page request" is in this context, in a way that I can replicate the count on current data using the sampled access logs that I have.
The "definition" link on the above says that a page is a URL "... that would be considered the actual page being requested, and not all of the individual items that make it up (such as graphics and audio clips). Some people call this metric page views or page impressions, and defaults to any URL that has an extension of .htm, .html or .cgi." This seems incongruent with the URLs used at Wikipedia.
Any help would be greatly appreciated. Please let me know if you have any questions.
Reid
Wikipedia current pages are in /wiki/* and is probably what is meant. graphics and audio clips at upload.wikimedia.org/* Some css at /w/skins* Special actions like edit a page /w/index.php*
Reid Priedhorsky wrote:
Hi folks,
As I mentioned a week or two ago, I am trying to model historical article view rates of English Wikipedia. One of the sources of data for this is:
http://stats.wikimedia.org/EN/TablesUsagePageRequest.htm
I was wondering what the specific definition of "page request" is in this context, in a way that I can replicate the count on current data using the sampled access logs that I have.
The "definition" link on the above says that a page is a URL "... that would be considered the actual page being requested, and not all of the individual items that make it up (such as graphics and audio clips). Some people call this metric page views or page impressions, and defaults to any URL that has an extension of .htm, .html or .cgi." This seems incongruent with the URLs used at Wikipedia.
Any help would be greatly appreciated. Please let me know if you have any questions.
I believe that figure came from Webalizer. We did attempt to make it roughly a page view count at the time, and that's probably as much as we can say about it now. It's a bit late for a detailed error analysis.
We should probably also tell you about the reqstats data, if you haven't found it already:
http://hemlock.knams.wikimedia.org/~leon/stats/reqstats/reqstats-yearly.png
This is the raw request rate, including images. Mark Bergsma may be able to supply you with the source data. I don't know if we have any equivalent data from before April 2006.
-- Tim Starling
Tim Starling wrote:
Reid Priedhorsky wrote:
As I mentioned a week or two ago, I am trying to model historical article view rates of English Wikipedia. One of the sources of data for this is:
http://stats.wikimedia.org/EN/TablesUsagePageRequest.htm
I was wondering what the specific definition of "page request" is in this context, in a way that I can replicate the count on current data using the sampled access logs that I have.
I believe that figure came from Webalizer. We did attempt to make it roughly a page view count at the time, and that's probably as much as we can say about it now. It's a bit late for a detailed error analysis.
OK. So that means that nobody's got the regular expressions around any more? Or have the URLs changed so that the old regexes are no longer relevant?
I tried counting URLs of the two forms:
http://en.wikipedia.org/wiki/Pierre_Omidyar http://en.wikipedia.org/w/index.php?title=Pierre_Omidyar
This yields 102 million article views per day on average over the past month or so.
When I graph log(Webalyzer data) and (log(Alexa data) - c), where c is a constant adjusted by hand to make the two lines match up pretty well, then the 102 Mviews/day figure also matches the Alexa data quite well.
This is encouraging and suggests that I'm counting correctly. Does that seem reasonable to you all as well?
We should probably also tell you about the reqstats data, if you haven't found it already:
http://hemlock.knams.wikimedia.org/~leon/stats/reqstats/reqstats-yearly.png
This is the raw request rate, including images. Mark Bergsma may be able to supply you with the source data. I don't know if we have any equivalent data from before April 2006.
Yeah, I saw that. I'm not sure exactly what to make of it, since I'm interested in article view rate, and that seemed to be all requests.
I would like to add that I really appreciate the ongoing help you all have given. It's made our task here at UMN considerably easier and more effective, and it means a lot to me.
Take care,
Reid
wikitech-l@lists.wikimedia.org