New tool - User dupes - Commons-l

List overview All Threads
Download

newer

New tool - User dupes

older

Wikimedia Foundation's help to the...

Cookie based blocking and bug...

Magnus Manske

3 Dec 2006 3 Dec '06

11:12 p.m.

Fulfilling a request, I added "User dupes" to my set of toys. For a user and a wiki (wikipedia or commons), it can find uploaded files identical in size (pixels and bytes) but different names.

Magnus

[1] http://tools.wikimedia.de/~magnus/userdupes.php

Show replies by date

Bence Damokos

4 Dec 4 Dec

3:27 p.m.

Great tool! Is there a way to make it cross wiki, so to find commons duplicates on a local wiki, that are not under the same name, and not from the same user?

Bence

On 12/3/06, Magnus Manske magnusmanske@googlemail.com wrote:

...

Fulfilling a request, I added "User dupes" to my set of toys. For a user and a wiki (wikipedia or commons), it can find uploaded files identical in size (pixels and bytes) but different names.

Magnus

[1] http://tools.wikimedia.de/~magnus/userdupes.php _______________________________________________ Commons-l mailing list Commons-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/commons-l

Magnus Manske

3:43 p.m.

On 12/4/06, Bence Damokos bdamokos@gmail.com wrote:

...

Great tool!

Thanks!

...

Is there a way to make it cross wiki, so to find commons duplicates on a local wiki, that are not under the same name, and not from the same user?

Well, cross-checking one million commons images against a few hundred thousand on one of the larger wikipedias might kill the toolserver quite efficiently ;-)

Or did you mean, given a wikipedia and a user, find duplicates on the local wiki and commons alike? That would be possible.

Magnus

Alexandre NOUVEL

4:33 p.m.

Hi all,

---Selon Magnus Manske magnusmanske@googlemail.com:

...

Well, cross-checking one million commons images against a few hundred thousand on one of the larger wikipedias might kill the toolserver quite efficiently ;-)

Well, I agree that image processing is a very CPU-consuming task, and cross-checking adds to the difficulty.

However, I think that it may be possible to build a kind of hash signature for each file and sort them to find duplicates. The process itself of hashing would require some time but may be splitted amongst some servers. The resulting hash lists may then be sorted, so that matching signatures would lead to further checking of their initial images.

One drawback for this solution is to maintain a huge index of all the signatures (each one associated with the image name and the originating wiki).

Or perhaps I'm just writing bullshit :)

Best regards from France, -- Alexandre.NOUVEL@alnoprods.net |-> http://www.alnoprods.net |-> La copie privée et l'auto-diffusion menacées : http://eucd.info \ I hate spam. I kill spammers. Non mais.

Magnus Manske

7:17 p.m.

On 12/4/06, Alexandre NOUVEL alexandre.nouvel@alnoprods.net wrote:

...

Hi all,

---Selon Magnus Manske magnusmanske@googlemail.com:

...
Well, cross-checking one million commons images against a few hundred thousand on one of the larger wikipedias might kill the toolserver quite efficiently ;-)

Well, I agree that image processing is a very CPU-consuming task, and cross-checking adds to the difficulty.

However, I think that it may be possible to build a kind of hash signature for each file and sort them to find duplicates. The process itself of hashing would require some time but may be splitted amongst some servers. The resulting hash lists may then be sorted, so that matching signatures would lead to further checking of their initial images.

There was a discussion somewhere (maybe on this list? I don't remember) to store MD5-hashes of image data in the table with the other image information (size etc.). Nothing came of it, I'm afraid. Too bad.

...

One drawback for this solution is to maintain a huge index of all the signatures (each one associated with the image name and the originating wiki).

With images being replaced, deleted, undeleted, etc. the only practical place is indeed the image table on the respective wiki. An outside solution (i.e. toolserver) is out of the question IMHO.

...

Or perhaps I'm just writing bullshit :)

Nope :-)

Magnus

Bryan Tong Minh

12 Dec 12 Dec

1:11 p.m.

On 12/4/06, Magnus Manske magnusmanske@googlemail.com wrote:

...

There was a discussion somewhere (maybe on this list? I don't remember) to store MD5-hashes of image data in the table with the other image information (size etc.). Nothing came of it, I'm afraid. Too bad.

What was the reason for this? Was it the technical difficulties, or just a lack of willingness?

Magnus Manske

2:06 p.m.

On 12/12/06, Bryan Tong Minh bryan.tongminh@gmail.com wrote:

...

On 12/4/06, Magnus Manske magnusmanske@googlemail.com wrote:

...
There was a discussion somewhere (maybe on this list? I don't remember) to store MD5-hashes of image data in the table with the other image information (size etc.). Nothing came of it, I'm afraid. Too bad.

What was the reason for this? Was it the technical difficulties, or just a lack of willingness?

I'd say 10% the former (MD5ing all existing images on all wikipedias and commons would mean server stress), and 90% the latter.

Magnus

Sherool

13 Dec 13 Dec

1:37 a.m.

On Tue, 12 Dec 2006 14:06:41 +0100, Magnus Manske magnusmanske@googlemail.com wrote:

...

On 12/12/06, Bryan Tong Minh bryan.tongminh@gmail.com wrote:

...
On 12/4/06, Magnus Manske magnusmanske@googlemail.com wrote:

...
There was a discussion somewhere (maybe on this list? I don't remember) to store MD5-hashes of image data in the table with the other image information (size etc.). Nothing came of it, I'm afraid. Too bad.

What was the reason for this? Was it the technical difficulties, or just a lack of willingness?

I'd say 10% the former (MD5ing all existing images on all wikipedias and commons would mean server stress), and 90% the latter.

I'm sure the job que could be tweaked to handle hashing of all existing images in the "background", it would probably take a few days to complete, but it would hardly need to be a major stress factor). Did anyone file an actual feature request in bugzilla or did the discussiuon just fizzle out before anyone got around to it?

-- [[:en:User:Sherool]]

Bryan Tong Minh

8:03 p.m.

That's a pity, since it would have been very handy for auto comparing images.

Bryan

On 12/12/06, Magnus Manske magnusmanske@googlemail.com wrote:

...

On 12/12/06, Bryan Tong Minh bryan.tongminh@gmail.com wrote:

...
On 12/4/06, Magnus Manske magnusmanske@googlemail.com wrote:

...
There was a discussion somewhere (maybe on this list? I don't remember) to store MD5-hashes of image data in the table with the other image information (size etc.). Nothing came of it, I'm afraid. Too bad.

What was the reason for this? Was it the technical difficulties, or just a lack of willingness?

I'd say 10% the former (MD5ing all existing images on all wikipedias and commons would mean server stress), and 90% the latter.

Magnus _______________________________________________ Commons-l mailing list Commons-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/commons-l

6602

Age (days ago)

6612

Last active (days ago)

commons-l@lists.wikimedia.org

8 comments

5 participants

tags (0)

participants (5)

Alexandre NOUVEL
Bence Damokos
Bryan Tong Minh
Magnus Manske
Sherool