[Announce] New daily feed: media file request counts

List overview All Threads
Download

newer

older

[Release] Hive wmf.webrequest new...

Draft blog post on decline in...

Erik Zachte

24 Mar 2015 24 Mar '15

1:39 p.m.

Today WMF Analytics announces a new product: a daily feed of media file request counts for all Wikimedia projects [1].

The counts are based on unsampled data, so any single request within the defined scope [2] will contribute to the counts.

It can be seen as complimentary to our page view counts files [5].

The file layout is documented on wikitech [3].

Daily counts have been backfilled from January 1, 2015 onwards.

Additionally there is a daily zip file which contains a small subset of these raw counts: top 1000 most requested media files, one csv file for each column [7]. As these csv files have headers (not so easy to add in Hive) you may want to start with this file for a first impression (best open in spreadsheet program).

The counts are collected from our Hadoop system, using a Hive query, with data markup done in UDF scripts. This feed hopefully addresses a long standing request, expressed often and by many, which we regrettably couldn't fulfil earlier, as our pre-Hadoop infrastructure and processing capacity were not up to the task.

An initial draft design (RFC) was presented last November at the Amsterdam Hackaton 2014 (GLAM and Wikidata).

Online consultation followed, leading to the current design [4].

This is a data feed with production status, but not the final release, as there is one major issue that hasn't been addressed yet (but progress is being made):

When using Media viewer to view images, some images are prefetched for better user experience, but these may never be shown to the user. Currently, those prefetched images are getting counted, as there is no way to detect whether an image was actually shown to the user or not.

Gilles Dubuc and other colleagues worked on a solution that would not hamper performance (a tough challenge) and would help us discern viewed from non-viewed files. A few days ago a patch was published! Adaptation of the Hive query will follow later. [6] Also, and related, context tagging isn't supported yet. [9]

Huge thanks to all people who contributed to the process so far, and still do.

Special thanks to Christian Aistleitner with whom I co-authored the design, and who also wrote the Hive implementation.

Erik Zachte

[1] http://dumps.wikimedia.org/other/mediacounts/ http://dumps.wikimedia.org/other/mediacounts/

[2] https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_coun ts#Filtering https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count s#Filtering

[3] https://wikitech.wikimedia.org/wiki/Analytics/Data/Mediacounts https://wikitech.wikimedia.org/wiki/Analytics/Data/Mediacounts

[4] https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_coun ts https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count s

[5] https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-all-sites https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-all-sites

(a new version of this data feed is in the works)

[6] https://phabricator.wikimedia.org/T89088

[7] Before you ask: no plans yet for further aggregation into monthly or yearly top ranking files. The current csv files are quick wins, using standard Linux tools.

[8] https://www.mediawiki.org/wiki/Multimedia/Media_Viewer

[9] https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count s#by_context

Attachments:

attachment.htm (text/html — 6.5 KB)

Show replies by date

Hay (Husky)

24 Mar 24 Mar

3:42 p.m.

Awesome! I'm especially glad that more statistics than 'just' the image views are included, like the aggregated views for thumbnails, and the media files as well. I just hope somebody will built a tool in the near future like stats.grok.se so we can view statistics for individual files and/or sets of files a la Bagalama2.

-- Hay

On Tue, Mar 24, 2015 at 6:39 PM, Erik Zachte ezachte@wikimedia.org wrote:

...

Today WMF Analytics announces a new product: a daily feed of media file request counts for all Wikimedia projects [1].

The counts are based on unsampled data, so any single request within the defined scope [2] will contribute to the counts.

It can be seen as complimentary to our page view counts files [5].

The file layout is documented on wikitech [3].

Daily counts have been backfilled from January 1, 2015 onwards.

Additionally there is a daily zip file which contains a small subset of these raw counts: top 1000 most requested media files, one csv file for each column [7]. As these csv files have headers (not so easy to add in Hive) you may want to start with this file for a first impression (best open in spreadsheet program).

The counts are collected from our Hadoop system, using a Hive query, with data markup done in UDF scripts. This feed hopefully addresses a long standing request, expressed often and by many, which we regrettably couldn't fulfil earlier, as our pre-Hadoop infrastructure and processing capacity were not up to the task.

An initial draft design (RFC) was presented last November at the Amsterdam Hackaton 2014 (GLAM and Wikidata).

Online consultation followed, leading to the current design [4].

This is a data feed with production status, but not the final release, as there is one major issue that hasn't been addressed yet (but progress is being made):

When using Media viewer to view images, some images are prefetched for better user experience, but these may never be shown to the user. Currently, those prefetched images are getting counted, as there is no way to detect whether an image was actually shown to the user or not.

Gilles Dubuc and other colleagues worked on a solution that would not hamper performance (a tough challenge) and would help us discern viewed from non-viewed files. A few days ago a patch was published! Adaptation of the Hive query will follow later. [6] Also, and related, context tagging isn't supported yet. [9]

Huge thanks to all people who contributed to the process so far, and still do.

Special thanks to Christian Aistleitner with whom I co-authored the design, and who also wrote the Hive implementation.

Erik Zachte

[1] http://dumps.wikimedia.org/other/mediacounts/

[2] https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count...

[3] https://wikitech.wikimedia.org/wiki/Analytics/Data/Mediacounts

[4] https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count...

[5] https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-all-sites
  (a new version of this data feed is in the works)
[6] https://phabricator.wikimedia.org/T89088

[7] Before you ask: no plans yet for further aggregation into monthly or yearly top ranking files. The current csv files are quick wins, using standard Linux tools.

[8] https://www.mediawiki.org/wiki/Multimedia/Media_Viewer

[9] https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count...

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Jane Darnell

3:47 p.m.

+1 - I just crashed my spreadsheet trying to open one .tsv file. But great news indeed Erik - this is an important first step!

On Tue, Mar 24, 2015 at 8:42 PM, Hay (Husky) huskyr@gmail.com wrote:

...

Awesome! I'm especially glad that more statistics than 'just' the image views are included, like the aggregated views for thumbnails, and the media files as well. I just hope somebody will built a tool in the near future like stats.grok.se so we can view statistics for individual files and/or sets of files a la Bagalama2.

-- Hay

On Tue, Mar 24, 2015 at 6:39 PM, Erik Zachte ezachte@wikimedia.org wrote:

...
Today WMF Analytics announces a new product: a daily feed of media file request counts for all Wikimedia projects [1].

The counts are based on unsampled data, so any single request within the defined scope [2] will contribute to the counts.

It can be seen as complimentary to our page view counts files [5].

The file layout is documented on wikitech [3].

Daily counts have been backfilled from January 1, 2015 onwards.

Additionally there is a daily zip file which contains a small subset of these raw counts: top 1000 most requested media files, one csv file for

each

...
column [7]. As these csv files have headers (not so easy to add in Hive)

you

...
may want to start with this file for a first impression (best open in spreadsheet program).

The counts are collected from our Hadoop system, using a Hive query, with data markup done in UDF scripts. This feed hopefully addresses a long standing request, expressed often and by many, which we regrettably

couldn't

...
fulfil earlier, as our pre-Hadoop infrastructure and processing capacity were not up to the task.

An initial draft design (RFC) was presented last November at the

Amsterdam

...
Hackaton 2014 (GLAM and Wikidata).

Online consultation followed, leading to the current design [4].

This is a data feed with production status, but not the final release, as there is one major issue that hasn't been addressed yet (but progress is being made):

When using Media viewer to view images, some images are prefetched for better user experience, but these may never be shown to the user.

Currently,

...
those prefetched images are getting counted, as there is no way to detect whether an image was actually shown to the user or not.

Gilles Dubuc and other colleagues worked on a solution that would not

hamper

...
performance (a tough challenge) and would help us discern viewed from non-viewed files. A few days ago a patch was published! Adaptation of the Hive query will follow later. [6] Also, and related, context tagging

isn't

...
supported yet. [9]

Huge thanks to all people who contributed to the process so far, and

still

...
do.

Special thanks to Christian Aistleitner with whom I co-authored the

design,

...
and who also wrote the Hive implementation.

Erik Zachte

[1] http://dumps.wikimedia.org/other/mediacounts/

[2]

https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count...

...
[3] https://wikitech.wikimedia.org/wiki/Analytics/Data/Mediacounts

[4]

https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count...

...
[5]

https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-all-sites

...
  (a new version of this data feed is in the works)
[6] https://phabricator.wikimedia.org/T89088

[7] Before you ask: no plans yet for further aggregation into monthly or yearly top ranking files. The current csv files are quick wins, using standard Linux tools.

[8] https://www.mediawiki.org/wiki/Multimedia/Media_Viewer

[9]
https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count...

...

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Maarten Brinkerink

25 Mar 25 Mar

3:05 a.m.

Dear Erik,

Thanks for pointing to this nice development! Since I’m not so technical, I was wondering to what extend this development helps us reach the vision and requirements that have been described by Maarten Zeinstra as part of his research for the GW Toolset project (https://commons.wikimedia.org/wiki/Commons:GLAMwiki_Toolset_Project https://commons.wikimedia.org/wiki/Commons:GLAMwiki_Toolset_Project)?

See: https://commons.wikimedia.org/wiki/File:Report_on_requirements_for_usage_and... https://commons.wikimedia.org/wiki/File:Report_on_requirements_for_usage_and_reuse_statistics_for_GLAM_content.pdf

Best,

Maarten

...

Op 24 mrt. 2015, om 20:47 heeft Jane Darnell jane023@gmail.com het volgende geschreven:

+1 - I just crashed my spreadsheet trying to open one .tsv file. But great news indeed Erik - this is an important first step!

On Tue, Mar 24, 2015 at 8:42 PM, Hay (Husky) <huskyr@gmail.com mailto:huskyr@gmail.com> wrote: Awesome! I'm especially glad that more statistics than 'just' the image views are included, like the aggregated views for thumbnails, and the media files as well. I just hope somebody will built a tool in the near future like stats.grok.se http://stats.grok.se/ so we can view statistics for individual files and/or sets of files a la Bagalama2.

-- Hay

On Tue, Mar 24, 2015 at 6:39 PM, Erik Zachte <ezachte@wikimedia.org mailto:ezachte@wikimedia.org> wrote:

...
Today WMF Analytics announces a new product: a daily feed of media file request counts for all Wikimedia projects [1].

The counts are based on unsampled data, so any single request within the defined scope [2] will contribute to the counts.

It can be seen as complimentary to our page view counts files [5].

The file layout is documented on wikitech [3].

Daily counts have been backfilled from January 1, 2015 onwards.

Additionally there is a daily zip file which contains a small subset of these raw counts: top 1000 most requested media files, one csv file for each column [7]. As these csv files have headers (not so easy to add in Hive) you may want to start with this file for a first impression (best open in spreadsheet program).

The counts are collected from our Hadoop system, using a Hive query, with data markup done in UDF scripts. This feed hopefully addresses a long standing request, expressed often and by many, which we regrettably couldn't fulfil earlier, as our pre-Hadoop infrastructure and processing capacity were not up to the task.

An initial draft design (RFC) was presented last November at the Amsterdam Hackaton 2014 (GLAM and Wikidata).

Online consultation followed, leading to the current design [4].

This is a data feed with production status, but not the final release, as there is one major issue that hasn't been addressed yet (but progress is being made):

When using Media viewer to view images, some images are prefetched for better user experience, but these may never be shown to the user. Currently, those prefetched images are getting counted, as there is no way to detect whether an image was actually shown to the user or not.

Gilles Dubuc and other colleagues worked on a solution that would not hamper performance (a tough challenge) and would help us discern viewed from non-viewed files. A few days ago a patch was published! Adaptation of the Hive query will follow later. [6] Also, and related, context tagging isn't supported yet. [9]

Huge thanks to all people who contributed to the process so far, and still do.

Special thanks to Christian Aistleitner with whom I co-authored the design, and who also wrote the Hive implementation.

Erik Zachte

[1] http://dumps.wikimedia.org/other/mediacounts/ http://dumps.wikimedia.org/other/mediacounts/

[2] https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count... https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_counts#Filtering

[3] https://wikitech.wikimedia.org/wiki/Analytics/Data/Mediacounts https://wikitech.wikimedia.org/wiki/Analytics/Data/Mediacounts

[4] https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count... https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_counts

[5] https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-all-sites https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-all-sites
  (a new version of this data feed is in the works)
[6] https://phabricator.wikimedia.org/T89088 https://phabricator.wikimedia.org/T89088

[7] Before you ask: no plans yet for further aggregation into monthly or yearly top ranking files. The current csv files are quick wins, using standard Linux tools.

[8] https://www.mediawiki.org/wiki/Multimedia/Media_Viewer https://www.mediawiki.org/wiki/Multimedia/Media_Viewer

[9] https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count... https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_counts#by_context

Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Hay (Husky)

6:03 a.m.

Answering my own question: until somebody puts up a stats.grok.se-like interface for the mediacounts, i've hacked together a Python script that can be used to 'query' the TSV files with a file, or a list of files:

https://github.com/hay/wiki-tools/blob/master/etc/mediacounts-stats.py

-- Hay

On Wed, Mar 25, 2015 at 8:05 AM, Maarten Brinkerink wmnl@maartenbrinkerink.net wrote:

...

Dear Erik,

Thanks for pointing to this nice development! Since I’m not so technical, I was wondering to what extend this development helps us reach the vision and requirements that have been described by Maarten Zeinstra as part of his research for the GW Toolset project (https://commons.wikimedia.org/wiki/Commons:GLAMwiki_Toolset_Project)?

See: https://commons.wikimedia.org/wiki/File:Report_on_requirements_for_usage_and...

Best,

Maarten

Op 24 mrt. 2015, om 20:47 heeft Jane Darnell jane023@gmail.com het volgende geschreven:

+1 - I just crashed my spreadsheet trying to open one .tsv file. But great news indeed Erik - this is an important first step!

On Tue, Mar 24, 2015 at 8:42 PM, Hay (Husky) huskyr@gmail.com wrote:

...
Awesome! I'm especially glad that more statistics than 'just' the image views are included, like the aggregated views for thumbnails, and the media files as well. I just hope somebody will built a tool in the near future like stats.grok.se so we can view statistics for individual files and/or sets of files a la Bagalama2.

-- Hay

On Tue, Mar 24, 2015 at 6:39 PM, Erik Zachte ezachte@wikimedia.org wrote:

...
Today WMF Analytics announces a new product: a daily feed of media file request counts for all Wikimedia projects [1].

The counts are based on unsampled data, so any single request within the defined scope [2] will contribute to the counts.

It can be seen as complimentary to our page view counts files [5].

The file layout is documented on wikitech [3].

Daily counts have been backfilled from January 1, 2015 onwards.

Additionally there is a daily zip file which contains a small subset of these raw counts: top 1000 most requested media files, one csv file for each column [7]. As these csv files have headers (not so easy to add in Hive) you may want to start with this file for a first impression (best open in spreadsheet program).

The counts are collected from our Hadoop system, using a Hive query, with data markup done in UDF scripts. This feed hopefully addresses a long standing request, expressed often and by many, which we regrettably couldn't fulfil earlier, as our pre-Hadoop infrastructure and processing capacity were not up to the task.

An initial draft design (RFC) was presented last November at the Amsterdam Hackaton 2014 (GLAM and Wikidata).

Online consultation followed, leading to the current design [4].

This is a data feed with production status, but not the final release, as there is one major issue that hasn't been addressed yet (but progress is being made):

When using Media viewer to view images, some images are prefetched for better user experience, but these may never be shown to the user. Currently, those prefetched images are getting counted, as there is no way to detect whether an image was actually shown to the user or not.

Gilles Dubuc and other colleagues worked on a solution that would not hamper performance (a tough challenge) and would help us discern viewed from non-viewed files. A few days ago a patch was published! Adaptation of the Hive query will follow later. [6] Also, and related, context tagging isn't supported yet. [9]

Huge thanks to all people who contributed to the process so far, and still do.

Special thanks to Christian Aistleitner with whom I co-authored the design, and who also wrote the Hive implementation.

Erik Zachte

[1] http://dumps.wikimedia.org/other/mediacounts/

[2]

https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count...

[3] https://wikitech.wikimedia.org/wiki/Analytics/Data/Mediacounts

[4]

https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count...

[5] https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-all-sites
  (a new version of this data feed is in the works)
[6] https://phabricator.wikimedia.org/T89088

[7] Before you ask: no plans yet for further aggregation into monthly or yearly top ranking files. The current csv files are quick wins, using standard Linux tools.

[8] https://www.mediawiki.org/wiki/Multimedia/Media_Viewer

[9]

https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count...

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Erik Zachte

10:08 a.m.

Awesome, Hay, thanks!

-----Original Message----- From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Hay (Husky) Sent: Wednesday, March 25, 2015 11:03 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] [Announce] New daily feed: media file request counts

https://github.com/hay/wiki-tools/blob/master/etc/mediacounts-stats.py

-- Hay

On Wed, Mar 25, 2015 at 8:05 AM, Maarten Brinkerink wmnl@maartenbrinkerink.net wrote:

...

Dear Erik,

Thanks for pointing to this nice development! Since I’m not so technical, I was wondering to what extend this development helps us reach the vision and requirements that have been described by Maarten Zeinstra as part of his research for the GW Toolset project (https://commons.wikimedia.org/wiki/Commons:GLAMwiki_Toolset_Project)?

See: https://commons.wikimedia.org/wiki/File:Report_on_requirements_for_usa ge_and_reuse_statistics_for_GLAM_content.pdf

Best,

Maarten

Op 24 mrt. 2015, om 20:47 heeft Jane Darnell jane023@gmail.com het volgende geschreven:

+1 - I just crashed my spreadsheet trying to open one .tsv file. But +great news indeed Erik - this is an important first step!

On Tue, Mar 24, 2015 at 8:42 PM, Hay (Husky) huskyr@gmail.com wrote:

...
Awesome! I'm especially glad that more statistics than 'just' the image views are included, like the aggregated views for thumbnails, and the media files as well. I just hope somebody will built a tool in the near future like stats.grok.se so we can view statistics for individual files and/or sets of files a la Bagalama2.

-- Hay

On Tue, Mar 24, 2015 at 6:39 PM, Erik Zachte ezachte@wikimedia.org wrote:

...
Today WMF Analytics announces a new product: a daily feed of media file request counts for all Wikimedia projects [1].

The counts are based on unsampled data, so any single request within the defined scope [2] will contribute to the counts.

It can be seen as complimentary to our page view counts files [5].

The file layout is documented on wikitech [3].

Daily counts have been backfilled from January 1, 2015 onwards.

Additionally there is a daily zip file which contains a small subset of these raw counts: top 1000 most requested media files, one csv file for each column [7]. As these csv files have headers (not so easy to add in Hive) you may want to start with this file for a first impression (best open in spreadsheet program).

The counts are collected from our Hadoop system, using a Hive query, with data markup done in UDF scripts. This feed hopefully addresses a long standing request, expressed often and by many, which we regrettably couldn't fulfil earlier, as our pre-Hadoop infrastructure and processing capacity were not up to the task.

An initial draft design (RFC) was presented last November at the Amsterdam Hackaton 2014 (GLAM and Wikidata).

Online consultation followed, leading to the current design [4].

This is a data feed with production status, but not the final release, as there is one major issue that hasn't been addressed yet (but progress is being made):

When using Media viewer to view images, some images are prefetched for better user experience, but these may never be shown to the user. Currently, those prefetched images are getting counted, as there is no way to detect whether an image was actually shown to the user or not.

Gilles Dubuc and other colleagues worked on a solution that would not hamper performance (a tough challenge) and would help us discern viewed from non-viewed files. A few days ago a patch was published! Adaptation of the Hive query will follow later. [6] Also, and related, context tagging isn't supported yet. [9]

Huge thanks to all people who contributed to the process so far, and still do.

Special thanks to Christian Aistleitner with whom I co-authored the design, and who also wrote the Hive implementation.

Erik Zachte

[1] http://dumps.wikimedia.org/other/mediacounts/

[2]

https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_requ est_counts#Filtering

[3] https://wikitech.wikimedia.org/wiki/Analytics/Data/Mediacounts

[4]

https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_requ est_counts

[5] https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-all-s ites
  (a new version of this data feed is in the works)
[6] https://phabricator.wikimedia.org/T89088

[7] Before you ask: no plans yet for further aggregation into monthly or yearly top ranking files. The current csv files are quick wins, using standard Linux tools.

[8] https://www.mediawiki.org/wiki/Multimedia/Media_Viewer

[9]

https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_requ est_counts#by_context

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Federico Leva (Nemo)

11:11 a.m.

Hay (Husky), 25/03/2015 11:03:

...

Answering my own question: until somebody puts up a stats.grok.se-like interface for the mediacounts, i've hacked together a Python script that can be used to 'query' the TSV files with a file, or a list of files:

https://github.com/hay/wiki-tools/blob/master/etc/mediacounts-stats.py

And I sent a small silly patch to give a category name like https://commons.wikimedia.org/wiki/Category:Media_from_BEIC as input. Example output attached for the lazy. Some data I found particularly interesting: 1) the sum of columns 11–14 (big thumbs), 2) the ratio between (1) and column 3 (total transfers), 3) column 24 (no Wikimedia referrer). Total transfers in this small sample seem even higher than pageviews. (1) counts thumbs above 400 pixels, which are usually not embedded by default: (2) should tell how many users probably clicked or did something else. (3) may indicate which files "went viral".

Nemo

Hay (Husky)

30 Mar 30 Mar

4:38 a.m.

For those interested: i've merged Nemo's patch, so anyone interested in doing queries for a category can use the script now without needing an additional list of files.

https://github.com/hay/wiki-tools/blob/master/etc/mediacounts-stats.py

-- Hay

On Wed, Mar 25, 2015 at 4:11 PM, Federico Leva (Nemo) nemowiki@gmail.com wrote:

...

Hay (Husky), 25/03/2015 11:03:

...
Answering my own question: until somebody puts up a stats.grok.se-like interface for the mediacounts, i've hacked together a Python script that can be used to 'query' the TSV files with a file, or a list of files:

https://github.com/hay/wiki-tools/blob/master/etc/mediacounts-stats.py

And I sent a small silly patch to give a category name like https://commons.wikimedia.org/wiki/Category:Media_from_BEIC as input. Example output attached for the lazy. Some data I found particularly interesting:

the sum of columns 11–14 (big thumbs),

the ratio between (1) and column 3 (total transfers),

column 24 (no Wikimedia referrer). Total transfers in this small sample seem even higher than

pageviews. (1) counts thumbs above 400 pixels, which are usually not embedded by default: (2) should tell how many users probably clicked or did something else. (3) may indicate which files "went viral".

Nemo

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Erik Zachte

25 Mar 25 Mar

10:07 a.m.

Dear Maarten,

Thanks for sending me this report. I had not seen it yet. No-one mentioned it in our RFC phase, possibly assuming it was common knowledge, but not for me. It provides good extra context for where these files are relevant for GLAM.

With these new data files some points on Maarten Zeinstra's substantial list of functional requirements can be addressed.

7.1.4 "Requests counts on object level": done! (BTW most desired aggregation level)

Frequency is daily. As said, no plans for monthly aggegration yet (which is most requested, according to the survey), but that can be done in post-processing. Hourly stats are not possible (few requests for this anyway).

7.2.1 "Fileview": done!

...

From the report: "The total number of impressions is the simplest statistic that can be run by the analytics team."

That may be true, but a quick look at our RFC document [4] (and its talk page, and long threads at phabricator) will show you even this wasn't easy at all, and isn't even complete yet (our ambition to make these new files relevant in different contexts, and thus contain counts broken down into many columns, of course contributed to the overall size of the task).

7.2.3 "FilePlay": done!

These files also address:

8.3.1. "Kraken needs to be able to canonicalise different versions (thumbs, stills, etc.) of files into one identity (File)" 8.4 "Kraken needs to be able to remain online even when it needs to do millions of comparisons on every hit that the Wikimedia Commons gets."

These files provide partial input for:

8.1.3 "Give me a monthly overview of the views my media files had (FileViews)" 8.1.5 "Give me a monthly overview of the number of plays my audio/video had (FilePlays)"

------------

Especially 7.1.3 and 7.1.5 list other requirements, which are mostly out of scope for these data files.

So these files are not about where media files are embedded, also aggregation of media files, e.g. by category, is not covered. As the report already mentions, (in 6. Current status), some of that is covered by existing data files and existing tools, even when some of those data and tools need renovation.

Summing up, these files comprise a basic building block and will hopefully be complemented with other data files in the future. I say hopefully, as I don't know about plans or commitments in this area.

Cheers, Erik

From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Maarten Brinkerink Sent: Wednesday, March 25, 2015 8:06 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] [Announce] New daily feed: media file request counts

Dear Erik,

See: https://commons.wikimedia.org/wiki/File:Report_on_requirements_for_usage_and...

Best,

Maarten

Op 24 mrt. 2015, om 20:47 heeft Jane Darnell jane023@gmail.com het volgende geschreven:

+1 - I just crashed my spreadsheet trying to open one .tsv file. But great news indeed Erik - this is an important first step!

On Tue, Mar 24, 2015 at 8:42 PM, Hay (Husky) huskyr@gmail.com wrote: Awesome! I'm especially glad that more statistics than 'just' the image views are included, like the aggregated views for thumbnails, and the media files as well. I just hope somebody will built a tool in the near future like stats.grok.se so we can view statistics for individual files and/or sets of files a la Bagalama2.

-- Hay

On Tue, Mar 24, 2015 at 6:39 PM, Erik Zachte ezachte@wikimedia.org wrote:

...

Today WMF Analytics announces a new product: a daily feed of media file request counts for all Wikimedia projects [1].

The counts are based on unsampled data, so any single request within the defined scope [2] will contribute to the counts.

It can be seen as complimentary to our page view counts files [5].

The file layout is documented on wikitech [3].

Daily counts have been backfilled from January 1, 2015 onwards.

Additionally there is a daily zip file which contains a small subset of these raw counts: top 1000 most requested media files, one csv file for each column [7]. As these csv files have headers (not so easy to add in Hive) you may want to start with this file for a first impression (best open in spreadsheet program).

The counts are collected from our Hadoop system, using a Hive query, with data markup done in UDF scripts. This feed hopefully addresses a long standing request, expressed often and by many, which we regrettably couldn't fulfil earlier, as our pre-Hadoop infrastructure and processing capacity were not up to the task.

An initial draft design (RFC) was presented last November at the Amsterdam Hackaton 2014 (GLAM and Wikidata).

Online consultation followed, leading to the current design [4].

This is a data feed with production status, but not the final release, as there is one major issue that hasn't been addressed yet (but progress is being made):

When using Media viewer to view images, some images are prefetched for better user experience, but these may never be shown to the user. Currently, those prefetched images are getting counted, as there is no way to detect whether an image was actually shown to the user or not.

Gilles Dubuc and other colleagues worked on a solution that would not hamper performance (a tough challenge) and would help us discern viewed from non-viewed files. A few days ago a patch was published! Adaptation of the Hive query will follow later. [6] Also, and related, context tagging isn't supported yet. [9]

Huge thanks to all people who contributed to the process so far, and still do.

Special thanks to Christian Aistleitner with whom I co-authored the design, and who also wrote the Hive implementation.

Erik Zachte

[1] http://dumps.wikimedia.org/other/mediacounts/

[2] https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count...

[3] https://wikitech.wikimedia.org/wiki/Analytics/Data/Mediacounts

[4] https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count...

[5] https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-all-sites
  (a new version of this data feed is in the works)
[6] https://phabricator.wikimedia.org/T89088

[7] Before you ask: no plans yet for further aggregation into monthly or yearly top ranking files. The current csv files are quick wins, using standard Linux tools.

[8] https://www.mediawiki.org/wiki/Multimedia/Media_Viewer

[9] https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count...

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

3569

Age (days ago)

3575

Last active (days ago)

analytics@lists.wikimedia.org

8 comments

5 participants

tags (0)

participants (5)

Erik Zachte
Federico Leva (Nemo)
Hay (Husky)
Jane Darnell
Maarten Brinkerink