Fulfilling a request, I added "User dupes" to my set of toys. For a user and a wiki (wikipedia or commons), it can find uploaded files identical in size (pixels and bytes) but different names.
Magnus
Great tool! Is there a way to make it cross wiki, so to find commons duplicates on a local wiki, that are not under the same name, and not from the same user?
Bence
On 12/3/06, Magnus Manske magnusmanske@googlemail.com wrote:
Fulfilling a request, I added "User dupes" to my set of toys. For a user and a wiki (wikipedia or commons), it can find uploaded files identical in size (pixels and bytes) but different names.
Magnus
[1] http://tools.wikimedia.de/~magnus/userdupes.php _______________________________________________ Commons-l mailing list Commons-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/commons-l
On 12/4/06, Bence Damokos bdamokos@gmail.com wrote:
Great tool!
Thanks!
Is there a way to make it cross wiki, so to find commons duplicates on a local wiki, that are not under the same name, and not from the same user?
Well, cross-checking one million commons images against a few hundred thousand on one of the larger wikipedias might kill the toolserver quite efficiently ;-)
Or did you mean, given a wikipedia and a user, find duplicates on the local wiki and commons alike? That would be possible.
Magnus
Hi all,
---Selon Magnus Manske magnusmanske@googlemail.com:
Well, cross-checking one million commons images against a few hundred thousand on one of the larger wikipedias might kill the toolserver quite efficiently ;-)
Well, I agree that image processing is a very CPU-consuming task, and cross-checking adds to the difficulty.
However, I think that it may be possible to build a kind of hash signature for each file and sort them to find duplicates. The process itself of hashing would require some time but may be splitted amongst some servers. The resulting hash lists may then be sorted, so that matching signatures would lead to further checking of their initial images.
One drawback for this solution is to maintain a huge index of all the signatures (each one associated with the image name and the originating wiki).
Or perhaps I'm just writing bullshit :)
Best regards from France, -- Alexandre.NOUVEL@alnoprods.net |-> http://www.alnoprods.net |-> La copie privée et l'auto-diffusion menacées : http://eucd.info \ I hate spam. I kill spammers. Non mais.
On 12/4/06, Alexandre NOUVEL alexandre.nouvel@alnoprods.net wrote:
Hi all,
---Selon Magnus Manske magnusmanske@googlemail.com:
Well, cross-checking one million commons images against a few hundred thousand on one of the larger wikipedias might kill the toolserver quite efficiently ;-)
Well, I agree that image processing is a very CPU-consuming task, and cross-checking adds to the difficulty.
However, I think that it may be possible to build a kind of hash signature for each file and sort them to find duplicates. The process itself of hashing would require some time but may be splitted amongst some servers. The resulting hash lists may then be sorted, so that matching signatures would lead to further checking of their initial images.
There was a discussion somewhere (maybe on this list? I don't remember) to store MD5-hashes of image data in the table with the other image information (size etc.). Nothing came of it, I'm afraid. Too bad.
One drawback for this solution is to maintain a huge index of all the signatures (each one associated with the image name and the originating wiki).
With images being replaced, deleted, undeleted, etc. the only practical place is indeed the image table on the respective wiki. An outside solution (i.e. toolserver) is out of the question IMHO.
Or perhaps I'm just writing bullshit :)
Nope :-)
Magnus
On 12/4/06, Magnus Manske magnusmanske@googlemail.com wrote:
There was a discussion somewhere (maybe on this list? I don't remember) to store MD5-hashes of image data in the table with the other image information (size etc.). Nothing came of it, I'm afraid. Too bad.
What was the reason for this? Was it the technical difficulties, or just a lack of willingness?
On 12/12/06, Bryan Tong Minh bryan.tongminh@gmail.com wrote:
On 12/4/06, Magnus Manske magnusmanske@googlemail.com wrote:
There was a discussion somewhere (maybe on this list? I don't remember) to store MD5-hashes of image data in the table with the other image information (size etc.). Nothing came of it, I'm afraid. Too bad.
What was the reason for this? Was it the technical difficulties, or just a lack of willingness?
I'd say 10% the former (MD5ing all existing images on all wikipedias and commons would mean server stress), and 90% the latter.
Magnus
On Tue, 12 Dec 2006 14:06:41 +0100, Magnus Manske magnusmanske@googlemail.com wrote:
On 12/12/06, Bryan Tong Minh bryan.tongminh@gmail.com wrote:
On 12/4/06, Magnus Manske magnusmanske@googlemail.com wrote:
There was a discussion somewhere (maybe on this list? I don't remember) to store MD5-hashes of image data in the table with the other image information (size etc.). Nothing came of it, I'm afraid. Too bad.
What was the reason for this? Was it the technical difficulties, or just a lack of willingness?
I'd say 10% the former (MD5ing all existing images on all wikipedias and commons would mean server stress), and 90% the latter.
I'm sure the job que could be tweaked to handle hashing of all existing images in the "background", it would probably take a few days to complete, but it would hardly need to be a major stress factor). Did anyone file an actual feature request in bugzilla or did the discussiuon just fizzle out before anyone got around to it?
That's a pity, since it would have been very handy for auto comparing images.
Bryan
On 12/12/06, Magnus Manske magnusmanske@googlemail.com wrote:
On 12/12/06, Bryan Tong Minh bryan.tongminh@gmail.com wrote:
On 12/4/06, Magnus Manske magnusmanske@googlemail.com wrote:
There was a discussion somewhere (maybe on this list? I don't remember) to store MD5-hashes of image data in the table with the other image information (size etc.). Nothing came of it, I'm afraid. Too bad.
What was the reason for this? Was it the technical difficulties, or just a lack of willingness?
I'd say 10% the former (MD5ing all existing images on all wikipedias and commons would mean server stress), and 90% the latter.
Magnus _______________________________________________ Commons-l mailing list Commons-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/commons-l