On 8/25/07, Erik Moeller erik@wikimedia.org wrote:
On 8/26/07, Bryan Tong Minh bryan.tongminh@gmail.com wrote:
What would be really useful is an sha archive of the internet ;P Imagine that we can find the source of an image by just looking it up in the archive.
That actually should be doable for the major image search engines. I'll try to get the idea passed around a bit at least.
If you do end up in one of these conversations: Also try to get a feel for how they'd feel about also generating a lookup key for fuzzy matching.
SHA-1 will allow us to catch bit identical duplicates, but it fails if someone resizes, crops, recompresses, or strips EXIF. Even if their change isn't visible. It would be a good first step but it is trivial to evade, even accidentally.
I've been working on writing software for doing fuzzy image matching. It has been a low priority project that I've worked on off-and-on for the last few months so it is slow in coming, but I will eventually produce something good or someone else will beat me to it.
It isn't something that we should allow to slow down the introduction of exact match searches, but it would be good to have the contacts ready when we can propose doing something more.
Also related to this subject is the request I sent to the board a while back on contacting copyright violation detection companies. I never heard any response:
---------- Forwarded message ---------- From: Gregory Maxwell gmaxwell@wikimedia.org Date: Feb 28, 2007 7:34 PM Subject: Contacting copyright violation detection companies. To: board-l@lists.wikimedia.org
There are several commercial companies that exist to to help copyright holders locate web sites which are infringing their copyrights.
They exact method of operation differs from company to company, but all appear to involve the company running a web spider that goes out and looks for possibly infringing content and all that I've found sell this as a service to content holders.
For example, one company is: Digimarc (http://www.digimarc.com/). With digimarc's approach content holders add invisible watermarks to their content which digimarc web spiders detect. Digimark also offers a no-cost software tool for Windows/Mac which decodes and displays any embedded watermarks.
What I'd like to do is contact one or more of these companies to explore opportunities for us to cooperate for our mutual benefit.
I see a number of benefits and a number of potential risks:
Benefits: * Reduction in copyright violating content on our projects. * Increased speed in detection of copyright violations. * An independent indicator of the effectiveness of our communities' ability to detect copyright violations. * An opportunity to make public statements about our efforts and differentiate ourselves from many other web 2.0 services and highly our higher goals * Increased evidence of due care on our part which may be useful in future legal disputes. * Improved efficiency - since some of these services would spider us anyways. Cooperation may yield decreased bandwidth usage, and without our cooperation our method of notice will be DMCA takedown requests. * Establishing a relationship before a possible change in legal climate switches these companies into a 'charge the service provider' business model.
Risks: * Incorrect detection: some companies may falsely claim ownership of public domain content. * The detection company may consider us a potential customer and nag us to purchase services. * Loss of goodwill from interacting with companies whose purpose can be publicly unpopular. --Forcing the takedown of illegally copied videos on youtube garnishes enough dislike, but many of these companies also play in the Digital restrictions management space.
It also may be possible that such companies might be interested in a live media feed, possibly a service we could sell them, or possibly income we forgo in the spirit of cooperation and mutual benefit. I suspect that we're an unattractive enough target and good enough policing ourselves that no one would be interested in paying.
I believe that the first risk can be resolved by setting this up to provide input to the community rather than some sort of automatic upload restriction. The second point is harder to address.
What I'm looking for is permission to make contact and see what possibilities exist, I would then report back to the board with my findings.
I'm also interested any guidance related to what we are willing to do which I could use in my initial discussions. For example, would we be willing to run a non-free company provided watermark detection tool to avoid having to send all our uploads off for checking?
I've been on the lookout to researchers interested in developing open source fuzzy image comparison tools for our own checking purposes (for example, to detect uploads of previously deleted content). I think that such tools will be important in the long term, but the proposed cooperation would not be mutually exclusive and would serve a different but related purpose.
(I'm not on the board list, so remember to copy me on replies)