Re: [Analytics] [Announce] New daily feed: media file request counts

24 Mar 2015

+1 - I just crashed my spreadsheet trying to open one .tsv file. But great
news indeed Erik - this is an important first step!

On Tue, Mar 24, 2015 at 8:42 PM, Hay (Husky) &lt;huskyr(a)gmail.com&gt; wrote:

...
  Awesome! I'm especially glad that more statistics
than 'just' the
 image views are included, like the aggregated views for thumbnails,
 and the media files as well. I just hope somebody will built a tool in
 the near future like stats.grok.se so we can view statistics for
 individual files and/or sets of files a la Bagalama2.

 -- Hay

 On Tue, Mar 24, 2015 at 6:39 PM, Erik Zachte &lt;ezachte(a)wikimedia.org&gt;
 wrote:
  Today WMF Analytics announces a new product: a
daily feed of media file
 request counts for all Wikimedia projects [1].

 The counts are based on unsampled data, so any single request within the
 defined scope [2] will contribute to the counts.

 It can be seen as complimentary to our page view counts files [5].

 The file layout is documented on wikitech [3].

 Daily counts have been backfilled from January 1, 2015 onwards.

 Additionally there is a daily zip file which contains a small subset of
 these raw counts: top 1000 most requested media files, one csv file for  each
  column [7]. As these csv files have headers (not
so easy to add in Hive)  you
  may want to start with this file for a first
impression (best open in
 spreadsheet program).

 The counts are collected from our Hadoop system, using a Hive query, with
 data markup done in UDF scripts. This feed hopefully addresses a long
 standing request, expressed often and by many, which we regrettably  couldn't
  fulfil earlier, as our pre-Hadoop infrastructure
and processing capacity
 were not up to the task.

 An initial draft design (RFC) was presented last November at the  Amsterdam
  Hackaton 2014 (GLAM and Wikidata).

 Online consultation followed, leading to the current design [4].

 This is a data feed with production status, but not the final release, as
 there is one major issue that hasn't been addressed yet (but progress is
 being made):

 When using Media viewer to view images, some images are prefetched for
 better user experience, but these may never be shown to the user.  Currently,
  those prefetched images are getting counted, as
there is no way to detect
 whether an image was actually shown to the user or not.

 Gilles Dubuc and other colleagues worked on a solution that would not  hamper
  performance (a tough challenge) and would help us
discern viewed from
 non-viewed files. A few days ago a patch was published! Adaptation of the
 Hive query will follow later. [6] Also, and related, context tagging  isn't
  supported yet. [9]

 Huge thanks to all people who contributed to the process so far, and  still
  do.

 Special thanks to Christian Aistleitner with whom I co-authored the  design,
  and who also wrote the Hive implementation.

 Erik Zachte

 [1] http://dumps.wikimedia.org/other/mediacounts/

 [2]

https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_coun…

 [3] https://wikitech.wikimedia.org/wiki/Analytics/Data/Mediacounts

 [4]

 https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_coun…

 [5] 
 https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-all-sites

       (a new version of this data feed is in the works)

 [6] https://phabricator.wikimedia.org/T89088

 [7] Before you ask: no plans yet for further aggregation into monthly or
 yearly top ranking files. The current csv files are quick wins, using
 standard Linux tools.

 [8] https://www.mediawiki.org/wiki/Multimedia/Media_Viewer

 [9]

https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_coun…

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] [Announce] New daily feed: media file request counts