Hi all,
---Selon Magnus Manske magnusmanske@googlemail.com:
Well, cross-checking one million commons images against a few hundred thousand on one of the larger wikipedias might kill the toolserver quite efficiently ;-)
Well, I agree that image processing is a very CPU-consuming task, and cross-checking adds to the difficulty.
However, I think that it may be possible to build a kind of hash signature for each file and sort them to find duplicates. The process itself of hashing would require some time but may be splitted amongst some servers. The resulting hash lists may then be sorted, so that matching signatures would lead to further checking of their initial images.
One drawback for this solution is to maintain a huge index of all the signatures (each one associated with the image name and the originating wiki).
Or perhaps I'm just writing bullshit :)
Best regards from France, -- Alexandre.NOUVEL@alnoprods.net |-> http://www.alnoprods.net |-> La copie privée et l'auto-diffusion menacées : http://eucd.info \ I hate spam. I kill spammers. Non mais.