I'm computing the url of an image by the following: (the md5 of the first char and the second two chars concat)
val md = MessageDigest.getInstance("MD5") val messageDigest = md.digest(fileName.getBytes) val md5 = (new BigInteger(1, messageDigest)).toString(16)
val hash1 = md5.substring(0, 1) val hash2 = md5.substring(0, 2)
val urlPart = hash1 + "/" + hash2 + "/" + fileName
Most of the time, the function works correctly but on a few cases, it is incorrect:
For "Stewie_Griffin.png", I get 2/26/Stewie_Griffin.png but the real one is 0/02/Stewie_Griffin.png
The source file info is here: http://en.wikipedia.org/wiki/File:Stewie_Griffin.png http://upload.wikimedia.org/wikipedia/en/0/02/Stewie_Griffin.png
Any ideas why the hashing scheme doesn't work sometimes?
I posted this question on stackoverflow but I might be able to get a better answer here.http://stackoverflow.com/questions/8389616/does-wikipedia-use-different-meth...
On Mon, Dec 5, 2011 at 5:25 PM, Tommy Chheng tommy.chheng@gmail.com wrote:
Most of the time, the function works correctly but on a few cases, it is incorrect:
For "Stewie_Griffin.png", I get 2/26/Stewie_Griffin.png but the real one is 0/02/Stewie_Griffin.png
Any ideas why the hashing scheme doesn't work sometimes?
Haven't used Java in a while, but make sure you're computing the right string.
brent@brent-desktop:~/pass$ echo -n "Stewie_Griffin.png" | md5sum 026fdc3cd32e81686456d875e668b9f6 -
Thanks, i'll debug this some more. I'm using DBpedia's extraction code.
On Mon, Dec 5, 2011 at 3:31 PM, OQ overlordq@gmail.com wrote:
On Mon, Dec 5, 2011 at 5:25 PM, Tommy Chheng tommy.chheng@gmail.com wrote:
Most of the time, the function works correctly but on a few cases, it is incorrect:
For "Stewie_Griffin.png", I get 2/26/Stewie_Griffin.png but the real one is 0/02/Stewie_Griffin.png
Any ideas why the hashing scheme doesn't work sometimes?
Haven't used Java in a while, but make sure you're computing the right string.
brent@brent-desktop:~/pass$ echo -n "Stewie_Griffin.png" | md5sum 026fdc3cd32e81686456d875e668b9f6 -
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
On Mon, Dec 5, 2011 at 5:33 PM, Tommy Chheng tommy.chheng@gmail.com wrote:
Thanks, i'll debug this some more. I'm using DBpedia's extraction code.
val md5 = (new BigInteger(1, messageDigest)).toString(16)
It's eating the leading zero when you're converting it to a BigInt. Not sure why, when AFAIK MessageDigest provides provides a toString method()
Thanks, here's a fix when the leading zero is being eaten: val md5 = if (result.length % 2 != 0) "0" + result else result
On Mon, Dec 5, 2011 at 3:37 PM, OQ overlordq@gmail.com wrote:
On Mon, Dec 5, 2011 at 5:33 PM, Tommy Chheng tommy.chheng@gmail.com wrote:
Thanks, i'll debug this some more. I'm using DBpedia's extraction code.
val md5 = (new BigInteger(1, messageDigest)).toString(16)
It's eating the leading zero when you're converting it to a BigInt. Not sure why, when AFAIK MessageDigest provides provides a toString method()
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Have you considered how your code will perform if the MD5 checksum has *two* leading zeroes?
Actually i did run into the bug immediately after. I ended using org.apache.commons.codec.digest.DigestUtils#md5Hex from apache which does a more careful job. Here are two test cases I used:
val name = "Stewie_Griffin.png" val results =getUrlHashPart(name) assertEquals("0/02/Stewie_Griffin.png", results) val name = "Batman_Kane.jpg" val results =getUrlHashPart(name) assertEquals("0/00/Batman_Kane.jpg", results)
On Wed, Dec 7, 2011 at 1:34 PM, Mark Wagner carnildo@gmail.com wrote:
Have you considered how your code will perform if the MD5 checksum has *two* leading zeroes?
-- Mark
On Mon, Dec 5, 2011 at 16:14, Tommy Chheng tommy.chheng@gmail.com wrote:
Thanks, here's a fix when the leading zero is being eaten: val md5 = if (result.length % 2 != 0) "0" + result else result
On Mon, Dec 5, 2011 at 3:37 PM, OQ overlordq@gmail.com wrote:
On Mon, Dec 5, 2011 at 5:33 PM, Tommy Chheng tommy.chheng@gmail.com wrote:
Thanks, i'll debug this some more. I'm using DBpedia's extraction code.
val md5 = (new BigInteger(1, messageDigest)).toString(16)
It's eating the leading zero when you're converting it to a BigInt. Not sure why, when AFAIK MessageDigest provides provides a toString method()
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
You may be more future proof by asking the API for the image url, rather then trying to figure it out your self, as each wiki install may have other factors that determine that director/hash structure (i've seen places that have 3 levels, not 2)
http://en.wikipedia.org/w/api.php?action=query&prop=imageinfo&iiprop...
On Mon, Dec 5, 2011 at 6:25 PM, Tommy Chheng tommy.chheng@gmail.com wrote:
I'm computing the url of an image by the following: (the md5 of the first char and the second two chars concat)
val md = MessageDigest.getInstance("MD5") val messageDigest = md.digest(fileName.getBytes) val md5 = (new BigInteger(1, messageDigest)).toString(16)
val hash1 = md5.substring(0, 1) val hash2 = md5.substring(0, 2)
val urlPart = hash1 + "/" + hash2 + "/" + fileName
Most of the time, the function works correctly but on a few cases, it is incorrect:
For "Stewie_Griffin.png", I get 2/26/Stewie_Griffin.png but the real one is 0/02/Stewie_Griffin.png
The source file info is here: http://en.wikipedia.org/wiki/File:Stewie_Griffin.png http://upload.wikimedia.org/wikipedia/en/0/02/Stewie_Griffin.png
Any ideas why the hashing scheme doesn't work sometimes?
I posted this question on stackoverflow but I might be able to get a better answer here.http://stackoverflow.com/questions/8389616/does-wikipedia-use-different-meth...
-- @tommychheng http://tommy.chheng.com
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
mediawiki-api@lists.wikimedia.org