I ran some benchmarks on one of the WMF machines. The input I used is a 137.5 MB (144,220,582 bytes) OGV file that someone asked me to upload to Commons recently. For each benchmark, I hashed the file 25 times and computed the average running time.
MD5: 393 ms SHA-1: 404 ms SHA-256: 1281 ms
Can we keep some perspective please? MD5 is plenty good enough for the purposes discussed here. It's fast, and almost as important, is easily supported by many OSs, libraries, etc. As far as collisions, there are plenty of easy solutions, such as:
* Check for a collision before allowing a new revision, and do something if so (to handle the pre-image attack)
* When reverting, do a select count(*) where md5=? and then do something more advanced when more than one match is found
* Use the checksum to find the revision fast, but still do a full byte comparison.
I've only seen one real attack scenario mentioned in this thread - that of someone creating a new page with the same checksum as an existing one, for purposes of messing up the reversion system. Are there other attacks we should worry about?
I'm also of the opinion that we should just store things as CHAR(32), unless someone thinks space is really at that much of a premium. The big advantage of 32 chars (i.e. 0-9a-f aka hexadecimal ) is that it's a standard way to represent things, making use of common tools (e.g. md5sum) much easier.