On Sun, Sep 18, 2011 at 1:55 AM, Robert Rohde rarohde@gmail.com wrote:
If collision attacks really matter we should use SHA-1.
If collision attacks really matter you should use, at least, SHA-256, no?
However, do any of the proposed use cases care about whether someone might intentionally inject a collision? In the proposed uses I've looked at it, it seems irrelevant. The intentional collision will get flagged as a revert and the text leading to that collision would be discarded. How is that a bad thing?
Well, what if the checksum of the initial page hasn't been calculated yet? Then some miscreant sets the page to spam which collides, and then the spam gets reverted. The good page would be the one that gets thrown out.
Maybe that's not feasible. Maybe it is. Either way, I'd feel very uncomfortable about the fact that someday someone might decide to use the checksums in some way in which collisions would matter.
Now I don't know how important the CPU differences in calculating the two versions would be. If they're significant enough, then fine, use MD5, but make sure there are warnings all over the place about its use.
(As another possibility, what if someone writes a bot to detect certain reverts? I can see spammers/vandals having a field day with this sort of thing.)
For offline analyses, there's no need to change the online database tables.
Need? That's debatable, but one of the major motivators is the desire to have hash values in database dumps (both for revert checks and for checksums on correct data import / export). Both of those are "offline" uses, but it is beneficial to have that information precomputed and stored rather than frequently regenerated.
Why not in a separate file? There's no need to get permission from anyone or mess with the schema to generate a file with revision ids and checksums. If WMF won't host it at the regular dump location (which I can't see why they wouldn't), you could host it at archive.org.