(resending to analytics@)

tl;dr I am hoping the setup for wikipage view counts (based on Varnish UDP logging) could be reused for thumbnails, and I am asking for advice on that.


Hi all,

while the tracking of user behavior on Wikimedia sites is generally in a sad state, there is a decent-enough external tool [1] for telling which Wikipedia articles are popular and how that popularity changes day-to-day. The same is not true for images, for which there are no public view statistics at all. As a poor but still-better-than-nothing tool, people have been looking and file description page view counts to get an estimate of the level of interest in an image. This will be rendered mostly useless by MediaViewer (per our logs, only about 2% of readers follow through to the file page), which makes GLAM people sad.

I am looking into alternative ways for supporting image curation with usage statistics. I can see four  use cases here:
1. track how often an image is seen
2. track how many people are interested in an image
3. track how many people download/reuse an image
4. track how many people are interested in image metadata (e.g. GLAMs like to know not only how many people have seen the image, but also how many of them have seen which institution it comes from)

I haven't yet collected much information on how much people actually want each of these; I would like to understand first which of them are realistically doable. I'll share my vague plans on how to implement them; I would appreciate if you could poke holes in them / suggest better solutions.


== Tracking image view counts (including thumbnails) ==

This seems largely analogous to how page view counts are tracked: in one case we need to know Varnish html hit counts, in the other Varnish image hit counts.

For the page view counts, according to [2] Varnishes send an UDP packet to a statistics server whenever a page is requested; the results are aggregated in hourly buckets, published at [2], then further aggregated into daily buckets, put into an SQL database and visualised on a 3rd party server.

This seems to be easily replicable for thumbnails / image originals:  deduplicate by file name (and possibly some sort of thumbnail size buckets), save in dump files, ask Henrik to include them in stats.grok.se (it would be probably easy to hack them into the current system if they get fake wikidb names). It would be even more reliable than page view counts because image redirects work differently from article redirects, and don't split the view counts forever.

The big drawback is caching: while HTML pages are not cached (in the sense that the browser always sends a request for them), most images are [3]. I don't think this could be realistically helped (in theory it could be possible to calculate from the page view stats and the imagelinks table the exact number of times an image has been displayed, but doing that on the scale of Wikipedia pageviews is not plausible), and it might even be considered a good thing: instead of view counts, we get a reasonable approximation of unique visitors (image viewers).


== Tracking people "expressing interest" in a file (whatever that means) ==

Knowing how many people see the thumbnail of a file is important, but - unlike article view counts - it does not really say how many people show any interest in an image. It might just happen to be included in an article they are interested in; maybe they never scrolled to the bottom of the article and haven't seen the actual image at all. So it is useful to know how many people clicked through the image.

Without MediaViewer, file page view counts can (and are) used for this (even if they have their own problems for Commons images); with MediaViewer there would have to be a way to tell apart a thumbnail that was requested for use in MediaViewer vs. for some other reason. This could be done similarly to how the ?download query parameters are handled: append a source=mediaviewer URL parameter to the filename, have Varnish rewrite it to avoid cache splitting, add the parameter to the UDP packet and filter on it when processing the logs. This would be a fairly generic mechanism which could be potentially used for other things (source=filepage, source=hovercards etc); it would still split the cache on the browser side, but given that the large thumbnails used by MediaViewer don't overlap much with the thumbnail sizes used on wiki pages, this does not seem tragic.


== Tracking the number of downloads/reuses ==

MediaViewer adds a ?download parameter to download links (to ensure that the Content-Disposition header is set), which could be used to track downloads. (It would miss downloads from other sources; I'm not sure there is a generic solution. Maybe something based on referrer or some other header, in case the browsers set those differently for downloaded and displayed images?)

Reuse is too vague a category to say anything about, but tracking views of an image which originate outside the Wikimedia universe seems possible, but too much effort and too different from the previous methods to be worth spending time on now.


== Tracking the interest in image metadata ==

MediaViewer collects global statistics on the ratio of people viewing an image vs. following through to the file page vs. scrolling down to open the panel with file metadata information. I don't think a per-file tracking of this would be useful or worth the effort; if it is needed, it would have to be done by some EventLogging-ish setup with MediaViewer creating tracking gifs whenever the metadata information is opened.


What do you think, is any of this plausible?


[1] http://stats.grok.se/
[2] http://dumps.wikimedia.org/other/pagecounts-raw/
[3] looking at the effects of a page load in the network tab, it seems that most thumbnails on a page are loaded from the browser cache, while there are a few which are requested from the servers which respond with a 304. Which images do that is deterministic but otherwise seems totally random; e.g. on the current enwiki mainpage, RalphBakshiJan09.jpg is loaded from cache while Prayuth_Jan-ocha_2010-06-17_ITN.jpg is always requested. I am curious about the reason for this.