Dear Maarten,
Thanks for sending me this report. I had not seen it yet. No-one mentioned it in our RFC
phase, possibly assuming it was common knowledge, but not for me. It provides good extra
context for where these files are relevant for GLAM.
With these new data files some points on Maarten Zeinstra's substantial list of
functional requirements can be addressed.
7.1.4 "Requests counts on object level": done! (BTW most desired aggregation
level)
Frequency is daily. As said, no plans for monthly aggegration yet (which is most
requested, according to the survey), but that can be done in post-processing. Hourly
stats are not possible (few requests for this anyway).
7.2.1 "Fileview": done!
From the report: "The total number of impressions
is the simplest statistic that can be run by the analytics team."
That may be true, but a quick look at our RFC document [4] (and its talk page, and long
threads at phabricator) will show you even this wasn't easy at all, and isn't even
complete yet (our ambition to make these new files relevant in different contexts, and
thus contain counts broken down into many columns, of course contributed to the overall
size of the task).
7.2.3 "FilePlay": done!
These files also address:
8.3.1. "Kraken needs to be able to canonicalise different versions (thumbs, stills,
etc.) of files into one identity (File)"
8.4 "Kraken needs to be able to remain online even when it needs to do millions of
comparisons on every hit that the Wikimedia Commons gets."
These files provide partial input for:
8.1.3 "Give me a monthly overview of the views my media files had (FileViews)"
8.1.5 "Give me a monthly overview of the number of plays my audio/video had
(FilePlays)"
------------
Especially 7.1.3 and 7.1.5 list other requirements, which are mostly out of scope for
these data files.
So these files are not about where media files are embedded, also aggregation of media
files, e.g. by category, is not covered.
As the report already mentions, (in 6. Current status), some of that is covered by
existing data files and existing tools, even when some of those data and tools need
renovation.
Summing up, these files comprise a basic building block and will hopefully be complemented
with other data files in the future.
I say hopefully, as I don't know about plans or commitments in this area.
Cheers,
Erik
From: analytics-bounces(a)lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org]
On Behalf Of Maarten Brinkerink
Sent: Wednesday, March 25, 2015 8:06
To: A mailing list for the Analytics Team at WMF and everybody who has an interest in
Wikipedia and analytics.
Subject: Re: [Analytics] [Announce] New daily feed: media file request counts
Dear Erik,
Thanks for pointing to this nice development! Since I’m not so technical, I was wondering
to what extend this development helps us reach the vision and requirements that have been
described by Maarten Zeinstra as part of his research for the GW Toolset project
(
https://commons.wikimedia.org/wiki/Commons:GLAMwiki_Toolset_Project)?
See:
https://commons.wikimedia.org/wiki/File:Report_on_requirements_for_usage_an…
Best,
Maarten
Op 24 mrt. 2015, om 20:47 heeft Jane Darnell <jane023(a)gmail.com> het volgende
geschreven:
+1 - I just crashed my spreadsheet trying to open one .tsv file. But great news indeed
Erik - this is an important first step!
On Tue, Mar 24, 2015 at 8:42 PM, Hay (Husky) <huskyr(a)gmail.com> wrote:
Awesome! I'm especially glad that more statistics than 'just' the
image views are included, like the aggregated views for thumbnails,
and the media files as well. I just hope somebody will built a tool in
the near future like stats.grok.se so we can view statistics for
individual files and/or sets of files a la Bagalama2.
-- Hay
On Tue, Mar 24, 2015 at 6:39 PM, Erik Zachte <ezachte(a)wikimedia.org> wrote:
Today WMF Analytics announces a new product: a daily
feed of media file
request counts for all Wikimedia projects [1].
The counts are based on unsampled data, so any single request within the
defined scope [2] will contribute to the counts.
It can be seen as complimentary to our page view counts files [5].
The file layout is documented on wikitech [3].
Daily counts have been backfilled from January 1, 2015 onwards.
Additionally there is a daily zip file which contains a small subset of
these raw counts: top 1000 most requested media files, one csv file for each
column [7]. As these csv files have headers (not so easy to add in Hive) you
may want to start with this file for a first impression (best open in
spreadsheet program).
The counts are collected from our Hadoop system, using a Hive query, with
data markup done in UDF scripts. This feed hopefully addresses a long
standing request, expressed often and by many, which we regrettably couldn't
fulfil earlier, as our pre-Hadoop infrastructure and processing capacity
were not up to the task.
An initial draft design (RFC) was presented last November at the Amsterdam
Hackaton 2014 (GLAM and Wikidata).
Online consultation followed, leading to the current design [4].
This is a data feed with production status, but not the final release, as
there is one major issue that hasn't been addressed yet (but progress is
being made):
When using Media viewer to view images, some images are prefetched for
better user experience, but these may never be shown to the user. Currently,
those prefetched images are getting counted, as there is no way to detect
whether an image was actually shown to the user or not.
Gilles Dubuc and other colleagues worked on a solution that would not hamper
performance (a tough challenge) and would help us discern viewed from
non-viewed files. A few days ago a patch was published! Adaptation of the
Hive query will follow later. [6] Also, and related, context tagging isn't
supported yet. [9]
Huge thanks to all people who contributed to the process so far, and still
do.
Special thanks to Christian Aistleitner with whom I co-authored the design,
and who also wrote the Hive implementation.
Erik Zachte
[1]
http://dumps.wikimedia.org/other/mediacounts/
[2]
https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_coun…
[3]
https://wikitech.wikimedia.org/wiki/Analytics/Data/Mediacounts
[4]
https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_coun…
[5]
https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-all-sites
(a new version of this data feed is in the works)
[6]
https://phabricator.wikimedia.org/T89088
[7] Before you ask: no plans yet for further aggregation into monthly or
yearly top ranking files. The current csv files are quick wins, using
standard Linux tools.
[8]
https://www.mediawiki.org/wiki/Multimedia/Media_Viewer
[9]
https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_coun…
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics