Fwd: Per-file view counts - Analytics

23 May 2014


      (resending to analytics@)
tl;dr I am hoping the setup for wikipage view counts (based on Varnish UDP
logging) could be reused for thumbnails, and I am asking for advice on that.
Hi all,
while the tracking of user behavior on Wikimedia sites is generally in a
sad state, there is a decent-enough external tool [1] for telling which
Wikipedia articles are popular and how that popularity changes day-to-day.
The same is not true for images, for which there are no public view
statistics at all. As a poor but still-better-than-nothing tool, people
have been looking and file description page view counts to get an estimate
of the level of interest in an image. This will be rendered mostly useless
by MediaViewer (per our logs, only about 2% of readers follow through to
the file page), which makes GLAM people sad.
I am looking into alternative ways for supporting image curation with usage
statistics. I can see four  use cases here:
1. track how often an image is seen
2. track how many people are interested in an image
3. track how many people download/reuse an image
4. track how many people are interested in image metadata (e.g. GLAMs like
to know not only how many people have seen the image, but also how many of
them have seen which institution it comes from)
I haven't yet collected much information on how much people actually want
each of these; I would like to understand first which of them are
realistically doable. I'll share my vague plans on how to implement them; I
would appreciate if you could poke holes in them / suggest better solutions.
== Tracking image view counts (including thumbnails) ==
This seems largely analogous to how page view counts are tracked: in one
case we need to know Varnish html hit counts, in the other Varnish image
hit counts.
For the page view counts, according to [2] Varnishes send an UDP packet to
a statistics server whenever a page is requested; the results are
aggregated in hourly buckets, published at [2], then further aggregated
into daily buckets, put into an SQL database and visualised on a 3rd party
server.
This seems to be easily replicable for thumbnails / image originals:
 deduplicate by file name (and possibly some sort of thumbnail size
buckets), save in dump files, ask Henrik to include them in
stats.grok.se(it would be probably easy to hack them into the current
system if they get
fake wikidb names). It would be even more reliable than page view counts
because image redirects work differently from article redirects, and don't
split the view counts forever.
The big drawback is caching: while HTML pages are not cached (in the sense
that the browser always sends a request for them), most images are [3]. I
don't think this could be realistically helped (in theory it could be
possible to calculate from the page view stats and the imagelinks table the
exact number of times an image has been displayed, but doing that on the
scale of Wikipedia pageviews is not plausible), and it might even be
considered a good thing: instead of view counts, we get a reasonable
approximation of unique visitors (image viewers).
== Tracking people "expressing interest" in a file (whatever that means) ==
Knowing how many people see the thumbnail of a file is important, but -
unlike article view counts - it does not really say how many people show
any interest in an image. It might just happen to be included in an article
they are interested in; maybe they never scrolled to the bottom of the
article and haven't seen the actual image at all. So it is useful to know
how many people clicked through the image.
Without MediaViewer, file page view counts can (and are) used for this
(even if they have their own problems for Commons images); with MediaViewer
there would have to be a way to tell apart a thumbnail that was requested
for use in MediaViewer vs. for some other reason. This could be done
similarly to how the ?download query parameters are handled: append a
source=mediaviewer URL parameter to the filename, have Varnish rewrite it
to avoid cache splitting, add the parameter to the UDP packet and filter on
it when processing the logs. This would be a fairly generic mechanism which
could be potentially used for other things (source=filepage,
source=hovercards etc); it would still split the cache on the browser side,
but given that the large thumbnails used by MediaViewer don't overlap much
with the thumbnail sizes used on wiki pages, this does not seem tragic.
== Tracking the number of downloads/reuses ==
MediaViewer adds a ?download parameter to download links (to ensure that
the Content-Disposition header is set), which could be used to track
downloads. (It would miss downloads from other sources; I'm not sure there
is a generic solution. Maybe something based on referrer or some other
header, in case the browsers set those differently for downloaded and
displayed images?)
Reuse is too vague a category to say anything about, but tracking views of
an image which originate outside the Wikimedia universe seems possible, but
too much effort and too different from the previous methods to be worth
spending time on now.
== Tracking the interest in image metadata ==
MediaViewer collects global statistics on the ratio of people viewing an
image vs. following through to the file page vs. scrolling down to open the
panel with file metadata information. I don't think a per-file tracking of
this would be useful or worth the effort; if it is needed, it would have to
be done by some EventLogging-ish setup with MediaViewer creating tracking
gifs whenever the metadata information is opened.
What do you think, is any of this plausible?
[1] http://stats.grok.se/
[2] http://dumps.wikimedia.org/other/pagecounts-raw/
[3] looking at the effects of a page load in the network tab, it seems that
most thumbnails on a page are loaded from the browser cache, while there
are a few which are requested from the servers which respond with a 304.
Which images do that is deterministic but otherwise seems totally random;
e.g. on the current enwiki mainpage, RalphBakshiJan09.jpg is loaded from
cache while Prayuth_Jan-ocha_2010-06-17_ITN.jpg is always requested. I am
curious about the reason for this.