Hi Adrien!
Using the "visual word" approach I use in Pastec would enable the matching of modified images but would also require a lot more resources. Thus, while your hash is 256 bits long, an image signature in the Pastec index is approximately 8 KB.
8 KB still isn't too bad. It sounds like it could be useful.
Similarly, I guess that the search complexity of your hash approach is o(1) while in Pastec this is much more complicated: first "tf-idf" ranking and then two geometrical rerankings...
Close to o(1) at least. How does Pastec scale to many images? You mentioned having about 400,000 currently, which is still a rather fair number, but what about the full ~22M of Wikimedia Commons? I'm assuming that since tf-idf is a well known method for text mining, there are well understood and optimised algorithms to search. Perhaps something like Elasticsearch would be useful right away too?
That would be an advantage, since with our blockhash, we've had to implement relevant search algorithms ourselves lacking existing implementations.
One problem that we see and which was discussed recently on the commons-l mailing list, is the possibility of using approaches like yours and ours to identify duplicate images in Commons. We've generated a list of 21274 duplicate pairs, but some of them aren't actually duplicates, just very similar. Most commonly this is map data, like [1] and [2], where just a specific region differ.
I'm hypothesizing that your ORB detection would have better success there, since it would hopefully detect the colored area as a feature and be able to distinguish the two from each other.
In general, my feeling is that your work with ORB and our work with Blockhashes complement each other nicely. They work with different use cases, but have the same purpose, so being able to search using both would sometimes be an advantage. What is your strategy for scaling beyond your existing 400,000 images and is there some way we can cooperate on this? As we go about hashing additional sets (Flickr is a prime candidate), it would be interesting for us if we could generate both our blockhash and your ORB visual words signature in an easy way, since we any way retrieve the images.
[1] https://commons.wikimedia.org/wiki/File:Locator_map_Puerto_Rico_Trujillo_Alt... [2] https://commons.wikimedia.org/wiki/File:Locator_map_Puerto_Rico_Carolina.png