[Foundation-l] Questions about most viewed articles in 2011

Thu Jan 12 08:38:53 UTC 2012

On 11.01.2012 22:04, Bjoern Hoehrmann wrote:
> * Frédéric Schütz wrote:
>> So, why would "404 error" and "File:Hardy_boys_cover_09.jpg" be ranked
>> so high ? "404 error" is constant through the year (it may be a link
>>from a 404 page on a web server, but I'd be still surprised that it is
>> clicked so often), but the other is viewed in bursts (see e.g.
>> http://stats.grok.se/en/201012/File:Hardy_boys_cover_09.jpg). Any idea ?
>
> As I understand it, the access counts are derived from frontend cache
> log files without much of an attempt to filter out abnormalities like
> someone putting a stone on their keyboard to keep the F5 press down,
> which would cause their web browser to load the same page again and
> again.

Yes, that's right.

> More realistically, you have malfunctioning bots that end up
> in a loop causing them to request the same page many times, and there
> probably are deliberate attempts to push certain topics (even if you
> keep it down and request only once per minute, that would still be
> half a million per year, quite enough to get into "top" lists on the
> smaller Wikipedia versions, for instance).

I am still surprised by the 404 page, which is accessed a lot, but all 
through the year. The image is strange as well, with bursts here and 
there (not just many accesses once, and nothing else afterwards).

> If you look further down
> in your list, you'll probably find that Special:Export/* is extremely
> popular even though it's a very obscure feature, but apparently some
> articles are exported hundreds of thousands of times.

Yes, there is plenty of that.

> You would need access to additional data, like the IP addresses from
> where the requests come, or Referer header in requests, and so on, to
> attempt individual guesses (that data however cannot be published for
> privacy reasons, I do not know if it is collected at all or in what
> form).

As far as I can remember, IP addresses are geolocalized, and only the 
broad localization is kept (and that is (was) only for one request out 
of one thousand).

> So the numbers are rather rough and won't really tell you anything you
> did not already know (Sex>  Astrobiology, no surprise there), and you
> can't really say Steve Jobs>  Justin Bieber based on this data without
> explaining all the caveats anyway. The data is more useful if you look
> for general trends like in http://katograph.appspot.com/ which tells
> you things like that articles on people in Film are viewed much more
> often than people in Sports which is at least slightly non-obvious.

That's another question: what is the easiest way to know which articles 
belong to a given broad category ?

Categories do not work well: try to take a top category such as 
Category:Sports and go down the category tree; you'll probably get most 
of the Wikipedia pages (at least, you'll definitively get many pages 
that have nothing to do with sports). WikiProjects seem better, but 
there are many of them (including sub-projects, etc), so it is not easy 
to automatize.

Frédéric