Data mining for media archives

List overview All Threads
Download

newer

older

Annual Plan development

Textbooks Which Borrow Heavily...

Samuel Klein

6 Feb 2014 6 Feb '14

10:39 p.m.

John Resig has just published some excellent data analysis combining TinEye, image archives, and image clustering and deduplication to identify identical and similar images across a large corpus.

http://ejohn.org/research/computer-vision-photo-archives/

Are we doing any commons analysis like this at the moment? Is any similarity-analysis done on upload to help uploaders identify copies of the same image that already exist online? Or to flag potential copyvios for reviewers?

I'm sure TinEye would be glad to give us high-volume API access to enable that sort of cross-referencing.

Show replies by date

Fæ

7 Feb 7 Feb

12:59 a.m.

New subject: [Commons-l] Data mining for media archives

On 6 Feb 2014 22:40, "Samuel Klein" meta.sj@gmail.com wrote: ...

...

Are we doing any commons analysis like this at the moment? Is any similarity-analysis done on upload to help uploaders identify copies of the same image that already exist online? Or to flag potential copyvios for reviewers

Yes O:-) Checkout Faebot's work with Tineye here: https://commons.m.wikimedia.org/wiki/User:Faebot/SandboxM

Samuel Klein

4:04 a.m.

New subject: [Commons-l] Data mining for media archives

That's just beautiful. Thank you, Fae & Faebot.

I see that job filtered for mobile uploads without EXIF data. What obstacles do you envision for running such a service for all images?

On Thu, Feb 6, 2014 at 7:59 PM, Fæ faewik@gmail.com wrote:

...

On 6 Feb 2014 22:40, "Samuel Klein" meta.sj@gmail.com wrote: ...

...
Are we doing any commons analysis like this at the moment? Is any similarity-analysis done on upload to help uploaders identify copies of the same image that already exist online? Or to flag potential copyvios for reviewers

Yes O:-) Checkout Faebot's work with Tineye here: https://commons.m.wikimedia.org/wiki/User:Faebot/SandboxM _______________________________________________ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe

-- Samuel Klein @metasj w:user:sj +1 617 529 4266

Fæ

5:23 a.m.

New subject: [Commons-l] Data mining for media archives

On 7 February 2014 04:04, Samuel Klein meta.sj@gmail.com wrote:

...

That's just beautiful. Thank you, Fae & Faebot.

I see that job filtered for mobile uploads without EXIF data. What obstacles do you envision for running such a service for all images?

...
https://commons.m.wikimedia.org/wiki/User:Faebot/SandboxM

Technically, it could probably run for a subset of recently uploaded images in real-time. For a focus on finding copyright problems, results would be made more meaningful if a white-list/pre-filter were in place to ignore uploads from reliable sources, well established user accounts or where the EXIF data or templates applied made it highly unlikely to be a problem file (for example using templates showing it was an upload as part of a recognized wiki-project like WLM which has its own review process). From my experience with the mobile upload categories, I would expect a "file duplicate/possible copyvio to check" tag or report to be able to hit more than 90% successful at identifying a file that will get deleted as a policy violation, or unnecessary inferior duplicate/crop. With a little more wizardry, it should be possible to "red-flag" some of the files as TV screen shots, similar to previously deleted images, or even close matches to black-listed files (such as accepted DMCA take-downs or known spam files).

Other obstacles are less technical:

1. Faebot works without using the Tineye API, the API being quite restrictive in the number of queries. Many thousands of queries a day would require special permission from Tineye as even their "commercial" access appears too limited for the volume we might expect.

2. In reality, very few volunteers use Ogre's uploads from new accounts report and I have had almost no spontaneous feedback on my mobile uploads report. To make the output appealing, it may be better to either make a special dashboard, or use bot-placed-tags for "likely copyright issue" at the time of upload so that the flag gets used by new page patrol-ers in their reports and tools.

3. Volunteer time and making this a priority -- I have an interesting backlog of content creation, geo-location and potential GLAM projects, which are more glamorous and fun than fiddling with image-matching and copyright checking. To make a Tineye based 'similarityBot' work well, would probably take non-trivial research, testing, development time/code review, community consultation, report-writing, maintenance and bug-fixing... so this might be a candidate for a grant proposal with an element of paid dev time. I previously thought I might get a proposal together over the summer, along with more reading up on the Tineye API and possibly a bit more testing, but my thoughts on this are tentative right now.

4. Many of the highest number matches (100+) in Tineye are for images that are obviously public domain, such as photographs of well known 19th century paintings and at the same time, probably 50%+ of obvious copyright violations are those with just 3 or fewer matches on Tineye. Pulling the Tineye results in a more intelligent way is possible, for example Tineye can tell you if another version of the image in on a Wikimedia project (with a licence that probably applies to the uploaded image) or if it is hosted by a source that we recognize and can check the licence on, such as being on Flickr at a higher resolution and All Rights Reserved. Building a more intelligent bot is possible, but comes with an increasing maintenance headache as external websites continually change, including any APIs we might connect to and Tineye itself.

Fae

-- faewik@gmail.com http://j.mp/faewm

3954

Age (days ago)

3955

Last active (days ago)

wikimedia-l@lists.wikimedia.org

3 comments

2 participants

tags (0)

participants (2)

Fæ
Samuel Klein