> I ran some benchmarks on one of the WMF machines.
The input I used is
> a 137.5 MB (144,220,582 bytes) OGV file that someone asked me to
> upload to Commons recently. For each benchmark, I hashed the file 25
> times and computed the average running time.
>
> MD5: 393 ms
> SHA-1: 404 ms
> SHA-256: 1281 ms
Can we keep some perspective please? MD5 is plenty good enough for the
purposes discussed here. It's fast, and almost as important, is easily
supported by many OSs, libraries, etc. As far as collisions, there are
plenty of easy solutions, such as:
* Check for a collision before allowing a new revision, and do something
if so (to handle the pre-image attack)
* When reverting, do a select count(*) where md5=? and then do something
more advanced when more than one match is found
* Use the checksum to find the revision fast, but still do a full byte
comparison.
I've only seen one real attack scenario mentioned in this thread -
that of someone creating a new page with the same checksum as an existing
one, for purposes of messing up the reversion system. Are there other
attacks we should worry about?
I'm also of the opinion that we should just store things as CHAR(32),
unless someone thinks space is really at that much of a premium. The big
advantage of 32 chars (i.e. 0-9a-f aka hexadecimal ) is that it's a
standard way to represent things, making use of common tools (e.g. md5sum)
much easier.
--
Greg Sabino Mullane greg(a)endpoint.com
End Point Corporation
PGP Key: 0x14964AC8