On Sun, Sep 18, 2011 at 11:00 PM, Anthony wikimail@inbox.org wrote:
Now I don't know how important the CPU differences in calculating the two versions would be. If they're significant enough, then fine, use MD5, but make sure there are warnings all over the place about its use.
I ran some benchmarks on one of the WMF machines. The input I used is a 137.5 MB (144,220,582 bytes) OGV file that someone asked me to upload to Commons recently. For each benchmark, I hashed the file 25 times and computed the average running time.
MD5: 393 ms SHA-1: 404 ms SHA-256: 1281 ms
Note that the input size is many times higher than $wgMaxArticleSize, which is set to 2000 KB at WMF. For historical reasons, we have some revisions in our history that are larger; Ariel would be able to tell you how large, but I believe nothing in there is larger than 10 MB. So I decided to run the numbers for more realistic sizes as well, using the first 2 MB and 10 MB, respectively, of my OGV file.
For 2 MB (averages of 1000 runs):
MD5: 5.66 ms SHA-1: 5.85 ms SHA-256: 18.56 ms
For 10 MB (averages of 200 runs):
MD5: 28.6 ms SHA-1: 29.47 ms SHA-256: 93.49 ms
So yes, SHA-256 is a few times (just over 3x) more expensive to compute than SHA-1, which in turn is only a few percent slower than MD5. However, on the largest possible size we allow for new revisions it takes < 20ms. It sounds like that's an acceptable worst case for on-the-fly population, since saves and parses are slow anyway, especially for 2 MB of wikitext. The 10 MB case is only relevant for backfilling, which we could do from a maintenance script, and < 100ms is definitely acceptable there.
Roan Kattouw (Catrope)