[Foundation-l] Questions about most viewed articles in 2011
derhoermi at gmx.net
Wed Jan 11 21:04:44 UTC 2012
* Frédéric Schütz wrote:
>So, why would "404 error" and "File:Hardy_boys_cover_09.jpg" be ranked
>so high ? "404 error" is constant through the year (it may be a link
>from a 404 page on a web server, but I'd be still surprised that it is
>clicked so often), but the other is viewed in bursts (see e.g.
>http://stats.grok.se/en/201012/File:Hardy_boys_cover_09.jpg). Any idea ?
As I understand it, the access counts are derived from frontend cache
log files without much of an attempt to filter out abnormalities like
someone putting a stone on their keyboard to keep the F5 press down,
which would cause their web browser to load the same page again and
again. More realistically, you have malfunctioning bots that end up
in a loop causing them to request the same page many times, and there
probably are deliberate attempts to push certain topics (even if you
keep it down and request only once per minute, that would still be
half a million per year, quite enough to get into "top" lists on the
smaller Wikipedia versions, for instance). If you look further down
in your list, you'll probably find that Special:Export/* is extremely
popular even though it's a very obscure feature, but apparently some
articles are exported hundreds of thousands of times.
You would need access to additional data, like the IP addresses from
where the requests come, or Referer header in requests, and so on, to
attempt individual guesses (that data however cannot be published for
privacy reasons, I do not know if it is collected at all or in what
form). For the 404 error you could probably verify that the page is #1
for queries about it on various search engines in many locales (if you
assume there are 2 billion Internet users, and 6.6% bing the error and
get to the Wikipedia page, that would already explain the number).
So the numbers are rather rough and won't really tell you anything you
did not already know (Sex > Astrobiology, no surprise there), and you
can't really say Steve Jobs > Justin Bieber based on this data without
explaining all the caveats anyway. The data is more useful if you look
for general trends like in http://katograph.appspot.com/ which tells
you things like that articles on people in Film are viewed much more
often than people in Sports which is at least slightly non-obvious.
Björn Höhrmann · mailto:bjoern at hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
More information about the wikimedia-l