Re: [Wikitech-l] Open source mobile image recognition in Wikipedia

9 Dec 2014


      Hello Jonas,
...
...
Similarly, I guess that the search complexity of your hash approach is o(1)
while in Pastec this is much more complicated: first "tf-idf" ranking and
then two geometrical rerankings...
Close to o(1) at least. How does Pastec scale to many images? You
mentioned having about 400,000 currently, which is still a rather fair
number, but what about the full ~22M of Wikimedia Commons? I'm
assuming that since tf-idf is a well known method for text mining,
there are well understood and optimised algorithms to search. Perhaps
something like Elasticsearch would be useful right away too?
That would be an advantage, since with our blockhash, we've had to
implement relevant search algorithms ourselves lacking existing
implementations.
The tf-idf method used in Pastec is an adaptation of the algorithm for 
image ranking. So unfortunately, it seems also complicated to reuse 
implementations designed for texts.
To return results in real-time, the Pastec index must fit into the RAM. 
Having about 1M images per instance seems possible but to target the 22M 
of Wikimedia Commons, several instances running on several servers would 
be required. Where there is many images on an instance, the search times 
also increase significantly.
...
One problem that we see and which was discussed recently on the
commons-l mailing list, is the possibility of using approaches like
yours and ours to identify duplicate images in Commons. We've
generated a list of 21274 duplicate pairs, but some of them aren't
actually duplicates, just very similar. Most commonly this is map
data, like [1] and [2], where just a specific region differ.
I'm hypothesizing that your ORB detection would have better success
there, since it would hopefully detect the colored area as a feature
and be able to distinguish the two from each other.
Unfortunately, ORBs won't help you more here. They are computed only on 
the luminance place and are located at edge zones. They aim at 
retrieving similar images and in your example, the two images are 
perfect candidates for that.
...
In general, my feeling is that your work with ORB and our work with
Blockhashes complement each other nicely. They work with different use
cases, but have the same purpose, so being able to search using both
would sometimes be an advantage. What is your strategy for scaling
beyond your existing 400,000 images and is there some way we can
cooperate on this? As we go about hashing additional sets (Flickr is a
prime candidate), it would be interesting for us if we could generate
both our blockhash and your ORB visual words signature in an easy way,
since we any way retrieve the images.
Currently, I am not planning to scale a lot over ~1M images as I do not 
have the computational resources.
I think that your small hash approach, despite less robust to image 
modifications, is way more adapted to target such databases.
It would be possible to store and search the index on disk but that 
would be very slow and thus not practical.
Best regards,
-- 
Adrien Maglo
Pastec developer
http://www.pastec.io
+33 6 27 94 34 41

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Open source mobile image recognition in Wikipedia