Guillaume,

 

A few thoughts on file metadata cleanup drive:

 

1)      https://tools.wmflabs.org/mrmetadata/commons/commons/index.html and Category:Files with no machine-readable license show ~78k files without  machine-readable license. Couple years back we had a big push to make sure that all files on commons have licenses and we managed to fix all the files without them (they were mostly files where license was lost somehow or was not using one of the standard templates. Ever since we have a bot checking the database from time to time and adding files without license to Category:Media without a license: needs history check. New uploads get {{no license}} template and have a week to add one and old uploads, which likely lost a license somehow are processed manually. There are 29 files there now, all the other files on Commons do have a license or are tagged with {{no license}} or similar template. So all the files in Category:Files with no machine-readable license need work to be done with licenses, not files. I do not know what machine-readable metadata is needed but I can help with adding them.

2)      Your number of files missing machine-readable metadata on Commons: ~533,000,  seems a bit low. According to Special:MostTranscludedPages there are 24,136,218 files with licenses ({{License template tag}}‏‎), and 23,452,741 files with infobox templates ({{Information}} or {{Infobox template tag‏‎}}, so I would expect 683,477 files without any infobox templates.

3)      As I mentioned on Commons:Bots/Work_requests#An_example_pattern I would like to first give the original uploaders a chance to fix the files. We can do that by writing a standard message, which without any threat of deletion, ask for help with bringing their files up to current standards. We should have one message per uploader with a list of all the files that need infoboxes. We should also advise them on the use of VisualFileChange gadget or requesting specific tasks to be done by bots at Commons:Bots/Work requests. VisualFileChange gadget by user:Rillke does have an option “Prepend text, notify uploaders” which does almost what I need (one message per uploader), but I would prefer a python code.

4)      At some point I started adding such files to [[Category:Media missing infobox template]] for better tracking and started sub-categorizing them into

a.       Files with OTRS

b.      Files with {{information}} template which have some parsing issues

c.       Files with {{PD-Art}} which should use {{Artwork}} template and where the name of the uploader, upload date, and even source might not be relevant

d.      Files using PD license, like PD-old (except PD-Author or PD-User): for those files it might also the name of the uploader, upload date, and even source might not be relevant

It might be easier to add infoboxes for different groups of files. For example Magnus' add_information.php tool does not work well for artworks. We also seem to have users that specialize in different subjects and it might be easier to get their attention with smaller groups of files of one type.

 

Jarek T.

(user:Jarekt)

 

-----Original Message-----
From: commons-l-bounces@lists.wikimedia.org [mailto:commons-l-bounces@lists.wikimedia.org] On Behalf Of Guillaume Paumier
Sent: Thursday, December 11, 2014 2:16 PM
To: Coordination of technology deployments across languages/projects; Wikimedia Commons Discussion List
Subject: [Commons-l] File metadata cleanup drive: We now have numbers for Commons

 

Greetings,

 

As many of you are aware, we're currently in the process of collectively adding machine-readable metadata to many files and templates that don't have them, both on Commons and on all other Wikimedia wikis with local uploads [1,2]. This makes it much easier to see and re-use multimedia files consistently with best practices for attribution across a variety of channels (offline, PDF exports, mobile platforms, MediaViewer, WikiWand, etc.)

 

In October, I created a dashboard to track how many files were missing the machine-readable markers on each wiki [3]. Unfortunately, due to the size of Commons, I needed to find another way to count them there.

 

Yesterday, I finished to implement the script for Commons, and started to run it. As of today, we have accurate numbers for the quantity of files missing machine-readable metadata on Commons: ~533,000, out of

~24 million [4]. It may seem like a lot, but I personally think it's a great testament to the dedication of the Commons community.

 

Now that we have numbers, we can work on going through those files and fixing them. Many of them are missing the {{information}} template, but many of those are also part of a batch: either they were uploaded by the same user, or they were mass-uploaded by a bot. In either case, this makes it easier to parse the information and add the {{information}} template automatically with a bot, thus avoiding painful manual work.

 

I invite you to take a look at the list of files at https://tools.wmflabs.org/mrmetadata/commons/commons/index.html and see if you can find such groups and patterns.

 

Once you identify a pattern, you're encouraged to add a section to the Bot Requests page on Commons, so that a bot owner can fix them:

https://commons.wikimedia.org/wiki/Commons:Bots/Work_requests#Adding_the_Information_template_to_files_that_don.27t_have_it

 

I believe we can make a lot of progress rapidly if we dive into the list of files and fix all the groups we can find. The list and statistics will be updated daily so it'll be easy to see our progress.

 

Let me know if you'd like to help but are unsure how!

 

[1] https://meta.wikimedia.org/wiki/File_metadata_cleanup_drive

[2] https://blog.wikimedia.org/2014/11/07/cleaning-up-file-metadata-for-humans-and-robots/

[3] https://tools.wmflabs.org/mrmetadata/

[4] https://tools.wmflabs.org/mrmetadata/commons/commons/index.html

 

--

Guillaume Paumier

 

_______________________________________________

Commons-l mailing list

Commons-l@lists.wikimedia.org

https://lists.wikimedia.org/mailman/listinfo/commons-l