Re: [Commons-l] Hashing Wikimedia Commons

4 Sep 2014

Hi Jonas,

Awesome project!

I’m cc-ing the WMF Multimedia team, who might have some more answers :)

2014-09-04 12:26 GMT+02:00 Jonas Öberg &lt;jonas(a)commonsmachinery.se&gt;se>:

...
  Dear all,

 some of you may have been at our presentation during Wikimania and you'll
 find this familiar, but for the rest of you, I'm working with Commons
 Machinery on software that will hope to identify images on the web, even
 when they are used outside of their original context, to provide automatic
 attribution and a referral back to its origin. Imagine a blogger using a
 photo from Commons, visiting that blog and having a browser plugin overlay
 a small icon showing that the image is from Commons and inviting to find
 out more - even if the blogger forgot to attribute.

 We're currently working on an addon for Firefox to do just this, and we've
 previously worked out a backend to store the information we need to make
 these matches, some utilities for perceptual image hashing etc. We would
 love to work with images from Wikimedia Commons as a first dataset to
 explore how this will all work in practice.

 But in order to do so, we need information from Commons, and we want to
 make this as easy on the WMF servers as possible, so we'd appreciate some
 help and pointers. What we're looking at retrieving is information about
 (1) title, (2) author, (3) license, and (4) thumbnails of medium size.

 The first three we can get from pretty much either API, or extract
 directly from a dump file. The latter is eluding us though, for two
 reasons. One is that a file, like 30C3_Commons_Machinery_2.jpg, is actually
 in the /b/ba/ directory - but where this /b/ba/ comes from (a hash?) is
 unclear to us now, and it's not something we find in the dumps - though we
 can get it from one of the APIs.

 The other is thumbnail sizes. We need to retrieve a reasonably sized image
 (but in many cases less than the original size) of about 640px wide, so
 that we can then run a perceptual hash algorithm on this file.

 From what we can understand, you can request any size thumbnail on an
 image simply by prefixing it with the size you want (like
 123x-Filename.jpg). But it seems really silly to always request 640x for
 instance, since that would mean the WMF servers would need to generate that
 for us specifically if the resolution doesn't exist.

 What we'd find much more appealing is to be able to determine before
 making the call what sizes already exist and which can be retrieved without
 the WMF servers needing to rescale them for us. And while the viewer on
 Commons do seem to offer thumbnails in various sizes, we can't seem to get
 that information from any API.

 We can scrape the Commons web page for this information, but we figured
 that people here might have good ideas for how we approach this with
 minimal impact on the WMF servers :)

 Sincerely,
 Jonas

 _______________________________________________
 Commons-l mailing list
 Commons-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/commons-l

-- 
Jean-Frédéric

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [Commons-l] Hashing Wikimedia Commons