File metadata cleanup drive: We now have numbers for Commons

List overview All Threads
Download

newer

older

Happy Public Domain Day!

Re: [Commons-l]...

Guillaume Paumier

11 Dec 2014 11 Dec '14

11:16 a.m.

Greetings,

As many of you are aware, we're currently in the process of collectively adding machine-readable metadata to many files and templates that don't have them, both on Commons and on all other Wikimedia wikis with local uploads [1,2]. This makes it much easier to see and re-use multimedia files consistently with best practices for attribution across a variety of channels (offline, PDF exports, mobile platforms, MediaViewer, WikiWand, etc.)

In October, I created a dashboard to track how many files were missing the machine-readable markers on each wiki [3]. Unfortunately, due to the size of Commons, I needed to find another way to count them there.

Yesterday, I finished to implement the script for Commons, and started to run it. As of today, we have accurate numbers for the quantity of files missing machine-readable metadata on Commons: ~533,000, out of ~24 million [4]. It may seem like a lot, but I personally think it's a great testament to the dedication of the Commons community.

Now that we have numbers, we can work on going through those files and fixing them. Many of them are missing the {{information}} template, but many of those are also part of a batch: either they were uploaded by the same user, or they were mass-uploaded by a bot. In either case, this makes it easier to parse the information and add the {{information}} template automatically with a bot, thus avoiding painful manual work.

I invite you to take a look at the list of files at https://tools.wmflabs.org/mrmetadata/commons/commons/index.html and see if you can find such groups and patterns.

Once you identify a pattern, you're encouraged to add a section to the Bot Requests page on Commons, so that a bot owner can fix them: https://commons.wikimedia.org/wiki/Commons:Bots/Work_requests#Adding_the_Inf...

I believe we can make a lot of progress rapidly if we dive into the list of files and fix all the groups we can find. The list and statistics will be updated daily so it'll be easy to see our progress.

Let me know if you'd like to help but are unsure how!

[1] https://meta.wikimedia.org/wiki/File_metadata_cleanup_drive [2] https://blog.wikimedia.org/2014/11/07/cleaning-up-file-metadata-for-humans-a... [3] https://tools.wmflabs.org/mrmetadata/ [4] https://tools.wmflabs.org/mrmetadata/commons/commons/index.html

-- Guillaume Paumier

Show replies by date

Keegan Peterzell

11 Dec 11 Dec

3:01 p.m.

New subject: File metadata cleanup drive: We now have numbers for Commons

On Thu, Dec 11, 2014 at 1:16 PM, Guillaume Paumier gpaumier@wikimedia.org wrote:

...

Yesterday, I finished to implement the script for Commons, and started to run it. As of today, we have accurate numbers for the quantity of files missing machine-readable metadata on Commons: ~533,000, out of ~24 million [4]. It may seem like a lot, but I personally think it's a great testament to the dedication of the Commons community.

Wonderful. Thanks!

...

Now that we have numbers, we can work on going through those files and fixing them. Many of them are missing the {{information}} template, but many of those are also part of a batch: either they were uploaded by the same user, or they were mass-uploaded by a bot. In either case, this makes it easier to parse the information and add the {{information}} template automatically with a bot, thus avoiding painful manual work.

I've been poking at all of this with a stick in my free time, and it's true that a good number of these images are part of a set of images and the patterns are readily apparent. Magnus's No Information tool on labs is enormously helpful for retrieving these pattern sets since it's searchable by file name or the user/bot who uploaded the images[1]. I highly recommend it.

...

Once you identify a pattern, you're encouraged to add a section to the Bot Requests page on Commons, so that a bot owner can fix them:

https://commons.wikimedia.org/wiki/Commons:Bots/Work_requests#Adding_the_Inf...

Challenge accepted[2].

1. https://tools.wmflabs.org/add-information/no_information.php?language=common... 2. https://commons.wikimedia.org/w/index.php?title=Commons:Bots/Work_requests&a...

-- Keegan Peterzell Community Liaison, Product Wikimedia Foundation

Tuszynski, Jarek W.

7:44 p.m.

Guillaume,

A few thoughts on file metadata cleanup drive:

1) https://tools.wmflabs.org/mrmetadata/commons/commons/index.html and Category:Files with no machine-readable licensehttps://commons.wikimedia.org/wiki/Category:Files_with_no_machine-readable_license show ~78k files without machine-readable license. Couple years back we had a big push to make sure that all files on commons have licenses and we managed to fix all the files without them (they were mostly files where license was lost somehow or was not using one of the standard templates. Ever since we have a bot checking the database from time to time and adding files without license to Category:Media without a license: needs history checkhttps://commons.wikimedia.org/wiki/Category:Media_without_a_license:_needs_history_check. New uploads get {{no license}} template and have a week to add one and old uploads, which likely lost a license somehow are processed manually. There are 29 files there now, all the other files on Commons do have a license or are tagged with {{no license}} or similar template. So all the files in Category:Files with no machine-readable licensehttps://commons.wikimedia.org/wiki/Category:Files_with_no_machine-readable_license need work to be done with licenses, not files. I do not know what machine-readable metadata is needed but I can help with adding them.

2) Your number of files missing machine-readable metadata on Commons: ~533,000, seems a bit low. According to Special:MostTranscludedPageshttps://commons.wikimedia.org/wiki/Special:MostTranscludedPages there are 24,136,218 files with licenses ({{License template taghttps://commons.wikimedia.org/wiki/Template:License_template_tag}}‏‎), and 23,452,741 files with infobox templates ({{Information}} or {{Infobox template taghttps://commons.wikimedia.org/wiki/Template:Infobox_template_tag‏‎}}, so I would expect 683,477 files without any infobox templates.

3) As I mentioned on Commons:Bots/Work_requests#An_example_patternhttps://commons.wikimedia.org/wiki/Commons:Bots/Work_requests#An_example_pattern I would like to first give the original uploaders a chance to fix the files. We can do that by writing a standard message, which without any threat of deletion, ask for help with bringing their files up to current standards. We should have one message per uploader with a list of all the files that need infoboxes. We should also advise them on the use of VisualFileChange gadget or requesting specific tasks to be done by bots at Commons:Bots/Work requests. VisualFileChange gadget by user:Rillke does have an option “Prepend text, notify uploaders” which does almost what I need (one message per uploader), but I would prefer a python code.

4) At some point I started adding such files to [[Category:Media missing infobox templatehttps://commons.wikimedia.org/wiki/Category:Media_missing_infobox_template]] for better tracking and started sub-categorizing them into

a. Files with OTRS

b. Files with {{information}} template which have some parsing issues

c. Files with {{PD-Art}} which should use {{Artwork}} template and where the name of the uploader, upload date, and even source might not be relevant

d. Files using PD license, like PD-old (except PD-Author or PD-User): for those files it might also the name of the uploader, upload date, and even source might not be relevant

It might be easier to add infoboxes for different groups of files. For example Magnus' add_information.phphttp://toolserver.org/%7Emagnus/add_information.php tool does not work well for artworks. We also seem to have users that specialize in different subjects and it might be easier to get their attention with smaller groups of files of one type.

Jarek T.

(user:Jarekt)

-----Original Message----- From: commons-l-bounces@lists.wikimedia.org [mailto:commons-l-bounces@lists.wikimedia.org] On Behalf Of Guillaume Paumier Sent: Thursday, December 11, 2014 2:16 PM To: Coordination of technology deployments across languages/projects; Wikimedia Commons Discussion List Subject: [Commons-l] File metadata cleanup drive: We now have numbers for Commons

Greetings,

~24 million [4]. It may seem like a lot, but I personally think it's a great testament to the dedication of the Commons community.

I invite you to take a look at the list of files at https://tools.wmflabs.org/mrmetadata/commons/commons/index.html and see if you can find such groups and patterns.

Once you identify a pattern, you're encouraged to add a section to the Bot Requests page on Commons, so that a bot owner can fix them:

https://commons.wikimedia.org/wiki/Commons:Bots/Work_requests#Adding_the_Inf...

Let me know if you'd like to help but are unsure how!

[1] https://meta.wikimedia.org/wiki/File_metadata_cleanup_drive

[2] https://blog.wikimedia.org/2014/11/07/cleaning-up-file-metadata-for-humans-a...

[3] https://tools.wmflabs.org/mrmetadata/

[4] https://tools.wmflabs.org/mrmetadata/commons/commons/index.html

Guillaume Paumier

_______________________________________________

Commons-l mailing list

Commons-l@lists.wikimedia.orgmailto:Commons-l@lists.wikimedia.org

https://lists.wikimedia.org/mailman/listinfo/commons-l

Guillaume Paumier

12 Dec 12 Dec

12:56 p.m.

New subject: File metadata cleanup drive: We now have numbers for Commons

Hi,

Thank you for sharing your thoughts, Jarek :)

Le vendredi 12 décembre 2014, 03:44:54 Tuszynski, Jarek W. a écrit :

...

So all the files in Category:Files with no machine-readable licensehttps://commons.wikimedia.org/wiki/Category:Files_with_no_machine-r eadable_license need work to be done with licenses, not files. I do not know what machine-readable metadata is needed but I can help with adding them.

Yes, many of those are tricky because there isn't necessarily a "real" license attached to them (example: https://commons.wikimedia.org/wiki/File: %22A_Basket_full_of_Wool%22_(6360159381).jpg ) or the license isn't specific enough.

There are similar discussions at https://meta.wikimedia.org/wiki/Talk:File_metadata_cleanup_drive#How_to_hand... and https://meta.wikimedia.org/wiki/Talk:File_metadata_cleanup_drive#.22Presumed... and the best we might be able to do is to come up with a list of such cases and ask our wonderful lawyers how to handle them :)

...

 Your number of files missing machine-readable metadata on Commons:
~533,000, seems a bit low. According to Special:MostTranscludedPageshttps://commons.wikimedia.org/wiki/Special:Mos tTranscludedPages there are 24,136,218 files with licenses ({{License template taghttps://commons.wikimedia.org/wiki/Template:License_template_tag}}‏‎), and 23,452,741 files with infobox templates ({{Information}} or {{Infobox template taghttps://commons.wikimedia.org/wiki/Template:Infobox_template_tag‏‎}}, so I would expect 683,477 files without any infobox templates.

There are currently ~677,674 files* without any of the following templates:

'Information','Painting', 'Blason-fr-en', 'Blason-fr-en-it', 'Blason-xx', 'COAInformation', 'Artwork', 'Art_Photo','Photograph', 'Book', 'Map', 'Musical_work', 'Specimen'

If this list in incomplete (it probably is) or incorrect, let me know.

*Source: https://tools.wmflabs.org/mrmetadata/commons_list.txt (warning, 18MB text file).

But some of those do have machine-readable metadata picked up by CommonsMetadata even if they don't have an infobox, which brings the number down to ~533,000. It can be that they have templates we're not listing yet, or that they have MR metadata in their EXIF data. Some of the latter are false positives, per https://phabricator.wikimedia.org/T73719

...

 As I mentioned on
Commons:Bots/Work_requests#An_example_patternhttps://commons.wikimedia.org /wiki/Commons:Bots/Work_requests#An_example_pattern I would like to first give the original uploaders a chance to fix the files. We can do that by writing a standard message, which without any threat of deletion, ask for help with bringing their files up to current standards.

I'm not opposed to this in principle, but I'm not sure I see the value. We're not going to delete files, or change attribution, or anything like that; we're only going to take the existing information and put it into a template so it's easier to access.

My assumption is that most uploaders wouldn't care about such a change in formatting, and that it would entail more work for them to figure out how to do it themselves, than for a few bot owners to do it on a wider scale.

Is this assumption unreasonable?

...

 At some point I started adding such files to [[Category:Media
missing infobox templatehttps://commons.wikimedia.org/wiki/Category:Media_missing_infobox_ template]] for better tracking and started sub-categorizing them into

...

a. Files with OTRS

b. Files with {{information}} template which have some parsing issues

c. Files with {{PD-Art}} which should use {{Artwork}} template and where the name of the uploader, upload date, and even source might not be relevant

...

d. Files using PD license, like PD-old (except PD-Author or PD-User): for those files it might also the name of the uploader, upload date, and even source might not be relevant

...

It might be easier to add infoboxes for different groups of files. For example Magnus' add_information.phphttp://toolserver.org/%7Emagnus/add_information.php tool does not work well for artworks. We also seem to have users that specialize in different subjects and it might be easier to get their attention with smaller groups of files of one type.

Thank you for doing this! I think these will be great starting points for specific bot runs :)

-- Guillaume Paumier

Tuszynski, Jarek W.

9:21 p.m.

Ok, so some of the commons license templates are more solid than others, but the file you refer to used license Template:Flickr-no known copyright restrictions [1] https://commons.wikimedia.org/wiki/Template:Flickr-no_known_copyright_restrictions , The deletion of the template was discussed to death herehttps://commons.wikimedia.org/wiki/Commons:Deletion_requests/Template:Flickr-no_known_copyright_restrictions [2], but there was no consensus. It would be good to have a list of such templates. A query searching for templates used by files in that directory which transclude {{License template tag}} should do it, but I do not think I can create it with a CatScan3 tool. We do have ~1.5k license templateshttps://commons.wikimedia.org/wiki/User:Jarekt/f [3] (that number includes customized templates build from more generic ones), and some of them are very rarely used, so it would be good to look at them again.

About asking uploaders to add infoboxes. This idea come from 2 things: desire to get more people involved and uploaders are often interested in improving their files, and desire to simplify life of bot writers. I do not think it is possible to write a bot to get it always right. For example https://commons.wikimedia.org/wiki/File:AJ_3101_ant.jpg file just says {{GFDL}} and what is in the image and there is no information about who took the picture or even if the uploader thought the subject of the photo was GFDL or the photograph itself. Same with https://commons.wikimedia.org/wiki/File:Ajokoirat.png I do not know if it is a GFDL because it was copied from a website claiming GFDL or because author who upload it chose that license. By the way those files definitely do not meet current standards but in 2006 they were not unusual. If any of those guys are still around it would be nice if they could clean it up, because we can not guess those things.

Jarek T.

(user:Jarekt)

[1] https://commons.wikimedia.org/wiki/Template:Flickr-no_known_copyright_restri...

[2] https://commons.wikimedia.org/wiki/Commons:Deletion_requests/Template:Flickr...

[3] https://commons.wikimedia.org/wiki/User:Jarekt/f

-----Original Message----- From: Guillaume Paumier [mailto:gpaumier@wikimedia.org] Sent: Friday, December 12, 2014 3:57 PM To: commons-l@lists.wikimedia.org Cc: Tuszynski, Jarek W.; Coordination of technology deployments across languages/projects Subject: Re: [Commons-l] File metadata cleanup drive: We now have numbers for Commons

Hi,

Thank you for sharing your thoughts, Jarek :)

Le vendredi 12 décembre 2014, 03:44:54 Tuszynski, Jarek W. a écrit :

...

So all the files in Category:Files with no machine-readable

...

license<https://commons.wikimedia.org/wiki/Category:Files_with_no_mach

...

ine-r eadable_license> need work to be done with licenses, not files.

...

I do not know what machine-readable metadata is needed but I can help

...

with adding them.

Yes, many of those are tricky because there isn't necessarily a "real" license attached to them (example: https://commons.wikimedia.org/wiki/File:

%22A_Basket_full_of_Wool%22_(6360159381).jpg ) or the license isn't specific enough.

There are similar discussions at

https://meta.wikimedia.org/wiki/Talk:File_metadata_cleanup_drive#How_to_hand...

and

https://meta.wikimedia.org/wiki/Talk:File_metadata_cleanup_drive#.22Presumed...

and the best we might be able to do is to come up with a list of such cases and ask our wonderful lawyers how to handle them :)

...

 Your number of files missing machine-readable metadata on Commons:

...

~533,000, seems a bit low. According to

...

Special:MostTranscludedPages<https://commons.wikimedia.org/wiki/Specia

...

l:Mos

...

tTranscludedPages> there are 24,136,218 files with licenses ({{License

...

template

...

taghttps://commons.wikimedia.org/wiki/Template:License_template_tag}

...

}‏‎), and 23,452,741 files with infobox templates ({{Information}} or

...

{{Infobox template

...

taghttps://commons.wikimedia.org/wiki/Template:Infobox_template_tag‏

...

‎}}, so I would expect 683,477 files without any infobox templates.

There are currently ~677,674 files* without any of the following templates:

'Information','Painting', 'Blason-fr-en', 'Blason-fr-en-it', 'Blason-xx', 'COAInformation', 'Artwork', 'Art_Photo','Photograph', 'Book', 'Map', 'Musical_work', 'Specimen'

If this list in incomplete (it probably is) or incorrect, let me know.

*Source: https://tools.wmflabs.org/mrmetadata/commons_list.txt (warning, 18MB text file).

...

```
 As I mentioned on
```

...

Commons:Bots/Work_requests#An_example_pattern<https://commons.wikimedi

...

a.org /wiki/Commons:Bots/Work_requests#An_example_pattern> I would

...

like to first give the original uploaders a chance to fix the files.

...

We can do that by writing a standard message, which without any threat

...

of deletion, ask for help with bringing their files up to current

...

standards.

Is this assumption unreasonable?

...

 At some point I started adding such files to [[Category:Media

...

missing infobox

...

template<https://commons.wikimedia.org/wiki/Category:Media_missing_inf

...

obox_

...

template>]] for better tracking and started sub-categorizing them into

...

a. Files with OTRS

...

b. Files with {{information}} template which have some parsing issues

...

c. Files with {{PD-Art}} which should use {{Artwork}} template and

...

where the name of the uploader, upload date, and even source might not

...

be relevant

...

d. Files using PD license, like PD-old (except PD-Author or PD-User):

...

for those files it might also the name of the uploader, upload date,

...

and even source might not be relevant

...

It might be easier to add infoboxes for different groups of files. For

...

example Magnus'

...

add_information.php<http://toolserver.org/%7Emagnus/add_information.ph

...

p> tool does not work well for artworks. We also seem to have users

...

that specialize in different subjects and it might be easier to get

...

their attention with smaller groups of files of one type.

Thank you for doing this! I think these will be great starting points for specific bot runs :)

Guillaume Paumier

18 Dec 18 Dec

2:40 p.m.

New subject: File metadata cleanup drive: We now have numbers for Commons

Hi,

On Thu, Dec 11, 2014 at 8:16 PM, Guillaume Paumier gpaumier@wikimedia.org wrote:

...

I invite you to take a look at the list of files at https://tools.wmflabs.org/mrmetadata/commons/commons/index.html and see if you can find such groups and patterns.

Once you identify a pattern, you're encouraged to add a section to the Bot Requests page on Commons, so that a bot owner can fix them: https://commons.wikimedia.org/wiki/Commons:Bots/Work_requests#Adding_the_Inf...

Just a quick note to let you know that the Bot requests page is very active at the moment, with people identifying all sorts of patterns and using bots or the VisualFileChange script to do mass edits.

In just a few days, we've added the information template to more than 23,000 files! [1] Many thanks to Jarekt, Basv, Amir and everyone else who's been fixing all those files :)

Even if you don't own a bot, you can help by looking to the list of files, identifying groups of similar files, and adding a section to the Bot requests page so that the bots can fix the pages automatically.

The Requests page also has links to lists of files grouped by author / upload date if you want to use your favorite spreadsheet application to find groups of files.

[1] Historical tallies: https://tools.wmflabs.org/mrmetadata/commons/commons/historical_tallies.json

-- Guillaume Paumier

3613

Age (days ago)

3620

Last active (days ago)

commons-l@lists.wikimedia.org

5 comments

3 participants

tags (0)

participants (3)

Guillaume Paumier
Keegan Peterzell
Tuszynski, Jarek W.