Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table - Wikitech-l

19 Sep 2011

...
  > I ran some benchmarks on one of the WMF machines.
The input I used is
 > a 137.5 MB (144,220,582 bytes) OGV file that someone asked me to
 > upload to Commons recently. For each benchmark, I hashed the file 25
 > times and computed the average running time.
 >
 > MD5: 393 ms
 > SHA-1: 404 ms
 > SHA-256: 1281 ms 
Can we keep some perspective please? MD5 is plenty good enough for the 
purposes discussed here. It's fast, and almost as important, is easily 
supported by many OSs, libraries, etc. As far as collisions, there are 
plenty of easy solutions, such as:

* Check for a collision before allowing a new revision, and do something 
if so (to handle the pre-image attack)

* When reverting, do a select count(*) where md5=? and then do something 
more advanced when more than one match is found

* Use the checksum to find the revision fast, but still do a full byte 
comparison.

I've only seen one real attack scenario mentioned in this thread - 
that of someone creating a new page with the same checksum as an existing 
one, for purposes of messing up the reversion system. Are there other 
attacks we should worry about?

I'm also of the opinion that we should just store things as CHAR(32), 
unless someone thinks space is really at that much of a premium. The big 
advantage of 32 chars (i.e. 0-9a-f aka hexadecimal ) is that it's a 
standard way to represent things, making use of common tools (e.g. md5sum)
much easier.

-- 
Greg Sabino Mullane greg(a)endpoint.com
End Point Corporation
PGP Key: 0x14964AC8