Re: [Analytics] [Announce] New daily feed: media file request counts

25 Mar 2015

Awesome, Hay, thanks!

-----Original Message-----
From: analytics-bounces(a)lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org]
On Behalf Of Hay (Husky)
Sent: Wednesday, March 25, 2015 11:03
To: A mailing list for the Analytics Team at WMF and everybody who has an interest in
Wikipedia and analytics.
Subject: Re: [Analytics] [Announce] New daily feed: media file request counts

Answering my own question: until somebody puts up a stats.grok.se-like interface for the
mediacounts, i've hacked together a Python script that can be used to 'query'
the TSV files with a file, or a list of
files:

https://github.com/hay/wiki-tools/blob/master/etc/mediacounts-stats.py

-- Hay

On Wed, Mar 25, 2015 at 8:05 AM, Maarten Brinkerink &lt;wmnl(a)maartenbrinkerink.net&gt;
wrote:
...
  Dear Erik,

 Thanks for pointing to this nice development! Since I’m not so 
 technical, I was wondering to what extend this development helps us 
 reach the vision and requirements that have been described by Maarten 
 Zeinstra as part of his research for the GW Toolset project 
 (https://commons.wikimedia.org/wiki/Commons:GLAMwiki_Toolset_Project)?

 See:
 https://commons.wikimedia.org/wiki/File:Report_on_requirements_for_usa
 ge_and_reuse_statistics_for_GLAM_content.pdf

 Best,

 Maarten

 Op 24 mrt. 2015, om 20:47 heeft Jane Darnell &lt;jane023(a)gmail.com&gt; het 
 volgende geschreven:

 +1 - I just crashed my spreadsheet trying to open one .tsv file. But 
 +great
 news indeed Erik - this is an important first step!

 On Tue, Mar 24, 2015 at 8:42 PM, Hay (Husky) &lt;huskyr(a)gmail.com&gt; wrote:

 Awesome! I'm especially glad that more statistics than 'just' the 
 image views are included, like the aggregated views for thumbnails, 
 and the media files as well. I just hope somebody will built a tool 
 in the near future like stats.grok.se so we can view statistics for 
 individual files and/or sets of files a la Bagalama2.

 -- Hay

 On Tue, Mar 24, 2015 at 6:39 PM, Erik Zachte &lt;ezachte(a)wikimedia.org&gt;
 wrote:
  Today WMF Analytics announces a new product: a
daily feed of media 
 file request counts for all Wikimedia projects [1].

 The counts are based on unsampled data, so any single request 
 within the defined scope [2] will contribute to the counts.

 It can be seen as complimentary to our page view counts files [5].

 The file layout is documented on wikitech [3].

 Daily counts have been backfilled from January 1, 2015 onwards.

 Additionally there is a daily zip file which contains a small 
 subset of these raw counts: top 1000 most requested media files, 
 one csv file for each column [7]. As these csv files have headers 
 (not so easy to add in Hive) you may want to start with this file 
 for a first impression (best open in spreadsheet program).

 The counts are collected from our Hadoop system, using a Hive 
 query, with data markup done in UDF scripts. This feed hopefully 
 addresses a long standing request, expressed often and by many, 
 which we regrettably couldn't fulfil earlier, as our pre-Hadoop 
 infrastructure and processing capacity were not up to the task.

 An initial draft design (RFC) was presented last November at the 
 Amsterdam Hackaton 2014 (GLAM and Wikidata).

 Online consultation followed, leading to the current design [4].

 This is a data feed with production status, but not the final 
 release, as there is one major issue that hasn't been addressed yet 
 (but progress is being made):

 When using Media viewer to view images, some images are prefetched 
 for better user experience, but these may never be shown to the user.
 Currently,
 those prefetched images are getting counted, as there is no way to 
 detect whether an image was actually shown to the user or not.

 Gilles Dubuc and other colleagues worked on a solution that would 
 not hamper performance (a tough challenge) and would help us 
 discern viewed from non-viewed files. A few days ago a patch was 
 published! Adaptation of the Hive query will follow later. [6] 
 Also, and related, context tagging isn't supported yet. [9]

 Huge thanks to all people who contributed to the process so far, 
 and still do.

 Special thanks to Christian Aistleitner with whom I co-authored the 
 design, and who also wrote the Hive implementation.

 Erik Zachte

 [1] http://dumps.wikimedia.org/other/mediacounts/

 [2]

 https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_requ
 est_counts#Filtering

 [3] https://wikitech.wikimedia.org/wiki/Analytics/Data/Mediacounts

 [4]

 https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_requ
 est_counts

 [5]
 https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-all-s
 ites

       (a new version of this data feed is in the works)

 [6] https://phabricator.wikimedia.org/T89088

 [7] Before you ask: no plans yet for further aggregation into 
 monthly or yearly top ranking files. The current csv files are 
 quick wins, using standard Linux tools.

 [8] https://www.mediawiki.org/wiki/Multimedia/Media_Viewer

 [9]

 https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_requ
 est_counts#by_context

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics 

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] [Announce] New daily feed: media file request counts