On 11.01.2012 22:04, Bjoern Hoehrmann wrote:
* Frédéric Schütz wrote:
So, why would "404 error" and
"File:Hardy_boys_cover_09.jpg" be ranked
so high ? "404 error" is constant through the year (it may be a link
from a 404 page on a web server, but I'd be still surprised that it is
clicked so often), but the other is viewed in bursts (see e.g.
http://stats.grok.se/en/201012/File:Hardy_boys_cover_09.jpg). Any idea ?
As I understand it, the access counts are derived from frontend cache
log files without much of an attempt to filter out abnormalities like
someone putting a stone on their keyboard to keep the F5 press down,
which would cause their web browser to load the same page again and
again.
Yes, that's right.
More realistically, you have malfunctioning bots that
end up
in a loop causing them to request the same page many times, and there
probably are deliberate attempts to push certain topics (even if you
keep it down and request only once per minute, that would still be
half a million per year, quite enough to get into "top" lists on the
smaller Wikipedia versions, for instance).
I am still surprised by the 404 page, which is accessed a lot, but all
through the year. The image is strange as well, with bursts here and
there (not just many accesses once, and nothing else afterwards).
If you look further down
in your list, you'll probably find that Special:Export/* is extremely
popular even though it's a very obscure feature, but apparently some
articles are exported hundreds of thousands of times.
Yes, there is plenty of that.
You would need access to additional data, like the IP
addresses from
where the requests come, or Referer header in requests, and so on, to
attempt individual guesses (that data however cannot be published for
privacy reasons, I do not know if it is collected at all or in what
form).
As far as I can remember, IP addresses are geolocalized, and only the
broad localization is kept (and that is (was) only for one request out
of one thousand).
So the numbers are rather rough and won't really
tell you anything you
did not already know (Sex> Astrobiology, no surprise there), and you
can't really say Steve Jobs> Justin Bieber based on this data without
explaining all the caveats anyway. The data is more useful if you look
for general trends like in
http://katograph.appspot.com/ which tells
you things like that articles on people in Film are viewed much more
often than people in Sports which is at least slightly non-obvious.
That's another question: what is the easiest way to know which articles
belong to a given broad category ?
Categories do not work well: try to take a top category such as
Category:Sports and go down the category tree; you'll probably get most
of the Wikipedia pages (at least, you'll definitively get many pages
that have nothing to do with sports). WikiProjects seem better, but
there are many of them (including sub-projects, etc), so it is not easy
to automatize.
Frédéric