Today WMF Analytics announces a new product: a daily feed of media file
request counts for all Wikimedia projects [1].
The counts are based on unsampled data, so any single request within the
defined scope [2] will contribute to the counts.
It can be seen as complimentary to our page view counts files [5].
The file layout is documented on wikitech [3].
Daily counts have been backfilled from January 1, 2015 onwards.
Additionally there is a daily zip file which contains a small subset of
these raw counts: top 1000 most requested media files, one csv file for each
column [7]. As these csv files have headers (not so easy to add in Hive) you
may want to start with this file for a first impression (best open in
spreadsheet program).
The counts are collected from our Hadoop system, using a Hive query, with
data markup done in UDF scripts. This feed hopefully addresses a long
standing request, expressed often and by many, which we regrettably couldn't
fulfil earlier, as our pre-Hadoop infrastructure and processing capacity
were not up to the task.
An initial draft design (RFC) was presented last November at the Amsterdam
Hackaton 2014 (GLAM and Wikidata).
Online consultation followed, leading to the current design [4].
This is a data feed with production status, but not the final release, as
there is one major issue that hasn't been addressed yet (but progress is
being made):
When using Media viewer to view images, some images are prefetched for
better user experience, but these may never be shown to the user. Currently,
those prefetched images are getting counted, as there is no way to detect
whether an image was actually shown to the user or not.
Gilles Dubuc and other colleagues worked on a solution that would not hamper
performance (a tough challenge) and would help us discern viewed from
non-viewed files. A few days ago a patch was published! Adaptation of the
Hive query will follow later. [6] Also, and related, context tagging isn't
supported yet. [9]
Huge thanks to all people who contributed to the process so far, and still
do.
Special thanks to Christian Aistleitner with whom I co-authored the design,
and who also wrote the Hive implementation.
Erik Zachte
[1] <http://dumps.wikimedia.org/other/mediacounts/>
http://dumps.wikimedia.org/other/mediacounts/
[2]
<https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_coun
ts#Filtering>
https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count
s#Filtering
[3] <https://wikitech.wikimedia.org/wiki/Analytics/Data/Mediacounts>
https://wikitech.wikimedia.org/wiki/Analytics/Data/Mediacounts
[4]
<https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_coun
ts>
https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count
s
[5]
<https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-all-sites>
https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-all-sites
(a new version of this data feed is in the works)
[6]
https://phabricator.wikimedia.org/T89088
[7] Before you ask: no plans yet for further aggregation into monthly or
yearly top ranking files. The current csv files are quick wins, using
standard Linux tools.
[8]
https://www.mediawiki.org/wiki/Multimedia/Media_Viewer
[9]
https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count
s#by_context