Re: [Commons-l] Data mining for media archives

7 Feb 2014

      While any subject can be a copyright violation I find that people images
are the most frequent offenders, especially those that are less than 1000px
on the longest edge. so a rough to that range(if possible) would reduce the
volumes needing to be processed
On 7 February 2014 17:17, Federico Leva (Nemo) nemowiki@gmail.com wrote:
...
Samuel Klein, 06/02/2014 23:39:
Are we doing any commons analysis like this at the moment?
...
Is any similarity-analysis done on upload to help uploaders identify
copies of the same image that already exist online?  Or to flag
potential copyvios for reviewers?
I'm sure TinEye would be glad to give us high-volume API access to
enable that sort of cross-referencing.
Would they? It's something we really need a lot and that we should do for
all uploads everywhere to save our patrollers a lot of precious time, but
it always looked impossible.

If WMF is interested in helping it would be useful to know. Even

getting access to the existing search API key is a quest no hero is known
to have successfully completed despite repeated attempts. <
https://wikitech.wikimedia.org/wiki/Web_search%3E If it's possible to avoid
institutional bottlenecks completely that would also be useful to know.
2) We don't even know what percentage of Wikimedia Commons images are
included in TinEye and at what speed. Does someone manage to extract this
information from them?
As Fae says, good part of the work is integrating the results in the
patrollers' (and uploaders'?) workflow in a sensible way. Embedding it in
UploadWizard may be too much, but a "simple" bot which just places a tag on
suspicious images can be made into an extension too, if preferred to a mere
pywikibot script.
If the two premises above are positive, it should be included in <
https://www.mediawiki.org/wiki/Mentorship_programs/
Possible_projects#Wikimedia_Commons_.2F_multimedia>: GSoC is approaching!
Nemo

Commons-l mailing list
Commons-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/commons-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [Commons-l] Data mining for media archives