On Sun, Sep 18, 2011 at 1:55 AM, Robert Rohde rarohde@gmail.com wrote:
If collision attacks really matter we should use SHA-1.
If collision attacks really matter you should use, at least, SHA-256, no?
However, do any of the proposed use cases care about whether someone might intentionally inject a collision? In the proposed uses I've looked at it, it seems irrelevant. The intentional collision will get flagged as a revert and the text leading to that collision would be discarded. How is that a bad thing?
Well, what if the checksum of the initial page hasn't been calculated yet? Then some miscreant sets the page to spam which collides, and then the spam gets reverted. The good page would be the one that gets thrown out.
Maybe that's not feasible. Maybe it is. Either way, I'd feel very uncomfortable about the fact that someday someone might decide to use the checksums in some way in which collisions would matter.
Now I don't know how important the CPU differences in calculating the two versions would be. If they're significant enough, then fine, use MD5, but make sure there are warnings all over the place about its use.
(As another possibility, what if someone writes a bot to detect certain reverts? I can see spammers/vandals having a field day with this sort of thing.)
For offline analyses, there's no need to change the online database tables.
Need? That's debatable, but one of the major motivators is the desire to have hash values in database dumps (both for revert checks and for checksums on correct data import / export). Both of those are "offline" uses, but it is beneficial to have that information precomputed and stored rather than frequently regenerated.
Why not in a separate file? There's no need to get permission from anyone or mess with the schema to generate a file with revision ids and checksums. If WMF won't host it at the regular dump location (which I can't see why they wouldn't), you could host it at archive.org.
On Sun, Sep 18, 2011 at 11:00 PM, Anthony wikimail@inbox.org wrote:
Now I don't know how important the CPU differences in calculating the two versions would be. If they're significant enough, then fine, use MD5, but make sure there are warnings all over the place about its use.
I ran some benchmarks on one of the WMF machines. The input I used is a 137.5 MB (144,220,582 bytes) OGV file that someone asked me to upload to Commons recently. For each benchmark, I hashed the file 25 times and computed the average running time.
MD5: 393 ms SHA-1: 404 ms SHA-256: 1281 ms
Note that the input size is many times higher than $wgMaxArticleSize, which is set to 2000 KB at WMF. For historical reasons, we have some revisions in our history that are larger; Ariel would be able to tell you how large, but I believe nothing in there is larger than 10 MB. So I decided to run the numbers for more realistic sizes as well, using the first 2 MB and 10 MB, respectively, of my OGV file.
For 2 MB (averages of 1000 runs):
MD5: 5.66 ms SHA-1: 5.85 ms SHA-256: 18.56 ms
For 10 MB (averages of 200 runs):
MD5: 28.6 ms SHA-1: 29.47 ms SHA-256: 93.49 ms
So yes, SHA-256 is a few times (just over 3x) more expensive to compute than SHA-1, which in turn is only a few percent slower than MD5. However, on the largest possible size we allow for new revisions it takes < 20ms. It sounds like that's an acceptable worst case for on-the-fly population, since saves and parses are slow anyway, especially for 2 MB of wikitext. The 10 MB case is only relevant for backfilling, which we could do from a maintenance script, and < 100ms is definitely acceptable there.
Roan Kattouw (Catrope)
On Sun, Sep 18, 2011 at 6:00 PM, Roan Kattouw roan.kattouw@gmail.com wrote:
On Sun, Sep 18, 2011 at 11:00 PM, Anthony wikimail@inbox.org wrote:
Now I don't know how important the CPU differences in calculating the two versions would be. If they're significant enough, then fine, use MD5, but make sure there are warnings all over the place about its use.
I ran some benchmarks on one of the WMF machines. The input I used is a 137.5 MB (144,220,582 bytes) OGV file that someone asked me to upload to Commons recently. For each benchmark, I hashed the file 25 times and computed the average running time.
MD5: 393 ms SHA-1: 404 ms SHA-256: 1281 ms
Did you try any of the non-secure hash functions? If you're going to go with MD5, might as well go with the significantly faster CRC-64.
If you're just using it to detect reverts, then you can run the CRC-64 check first, and then confirm with a check of the entire message.
On Mon, Sep 19, 2011 at 12:12 AM, Anthony wikimail@inbox.org wrote:
Did you try any of the non-secure hash functions? If you're going to go with MD5, might as well go with the significantly faster CRC-64.
I included MD5 because MediaWiki currently uses it for some things, and SHA-1 because it had been suggested in this discussion. I didn't feel the need to include anything non-cryptographic because points have been made that choosing a cryptographic hash function would be wise (because the feature might be used for something different later, among other things) and worries were expressed that SHA-256 might be too slow. I think these benchmarks show that that slowness is not a real problem, so I think we should pick the right tool for the job rather than try to pick the fastest hash function. It wasn't a contest, just a test to see whether SHA-256 was within the realm of feasibility, performance-wise.
Roan
On 18/09/11 23:38, Roan Kattouw wrote:
On Mon, Sep 19, 2011 at 12:12 AM, Anthonywikimail@inbox.org wrote:
Did you try any of the non-secure hash functions? If you're going to go with MD5, might as well go with the significantly faster CRC-64.
I included MD5 because MediaWiki currently uses it for some things, and SHA-1 because it had been suggested in this discussion. I didn't feel the need to include anything non-cryptographic because points have been made that choosing a cryptographic hash function would be wise (because the feature might be used for something different later, among other things) and worries were expressed that SHA-256 might be too slow. I think these benchmarks show that that slowness is not a real problem, so I think we should pick the right tool for the job rather than try to pick the fastest hash function. It wasn't a contest, just a test to see whether SHA-256 was within the realm of feasibility, performance-wise.
Roan
This post by Bruce Schneier on the subject of hash collision/preimage attacks has some resources on the state-of-the-art of this as of 2009 -- people have been worrying about this for some years.
http://www.schneier.com/blog/archives/2009/06/ever_better_cry.html
As far as I know, there are no viable _preimage_ attacks available now even for MD5, let alone SHA-1, but given the increasing value of keeping Wikipedia's assets secure, and that the computational cost of using a better hash is becoming more and more trivial as time progresses, it would be prudent to make plans now for eventually upgrading to a better hash function.
In an ideal world, both SHA-1 and SHA-2 could be supported by MediaWiki, with the choice of the hash being configurable by the site administrator, and the chosen hash algorithm name being stored with the hash value itself in the database, allowing smooth transition from the old to the new once the administrator decides to flip the switch to use the better hash. In the short run, the default could be set to SHA-1 for performance reasons, and SHA-2 used only by the paranoid; in the longer run, new releases of MediaWiki could be shipped with SHA-2 as the default.
-- Neil
wikitech-l@lists.wikimedia.org