On Sun, Sep 18, 2011 at 11:00 PM, Anthony <wikimail(a)inbox.org> wrote:
Now I don't know how important the CPU differences
in calculating the
two versions would be. If they're significant enough, then fine, use
MD5, but make sure there are warnings all over the place about its
use.
I ran some benchmarks on one of the WMF machines. The input I used is
a 137.5 MB (144,220,582 bytes) OGV file that someone asked me to
upload to Commons recently. For each benchmark, I hashed the file 25
times and computed the average running time.
MD5: 393 ms
SHA-1: 404 ms
SHA-256: 1281 ms
Note that the input size is many times higher than $wgMaxArticleSize,
which is set to 2000 KB at WMF. For historical reasons, we have some
revisions in our history that are larger; Ariel would be able to tell
you how large, but I believe nothing in there is larger than 10 MB. So
I decided to run the numbers for more realistic sizes as well, using
the first 2 MB and 10 MB, respectively, of my OGV file.
For 2 MB (averages of 1000 runs):
MD5: 5.66 ms
SHA-1: 5.85 ms
SHA-256: 18.56 ms
For 10 MB (averages of 200 runs):
MD5: 28.6 ms
SHA-1: 29.47 ms
SHA-256: 93.49 ms
So yes, SHA-256 is a few times (just over 3x) more expensive to
compute than SHA-1, which in turn is only a few percent slower than
MD5. However, on the largest possible size we allow for new revisions
it takes < 20ms. It sounds like that's an acceptable worst case for
on-the-fly population, since saves and parses are slow anyway,
especially for 2 MB of wikitext. The 10 MB case is only relevant for
backfilling, which we could do from a maintenance script, and < 100ms
is definitely acceptable there.
Roan Kattouw (Catrope)