I ran some benchmarks on one of the WMF machines. The input I used is a 137.5 MB (144,220,582 bytes) OGV file that someone asked me to upload to Commons recently. For each benchmark, I hashed the file 25 times and computed the average running time.
MD5: 393 ms SHA-1: 404 ms SHA-256: 1281 ms
Can we keep some perspective please? MD5 is plenty good enough for the purposes discussed here. It's fast, and almost as important, is easily supported by many OSs, libraries, etc. As far as collisions, there are plenty of easy solutions, such as:
* Check for a collision before allowing a new revision, and do something if so (to handle the pre-image attack)
* When reverting, do a select count(*) where md5=? and then do something more advanced when more than one match is found
* Use the checksum to find the revision fast, but still do a full byte comparison.
I've only seen one real attack scenario mentioned in this thread - that of someone creating a new page with the same checksum as an existing one, for purposes of messing up the reversion system. Are there other attacks we should worry about?
I'm also of the opinion that we should just store things as CHAR(32), unless someone thinks space is really at that much of a premium. The big advantage of 32 chars (i.e. 0-9a-f aka hexadecimal ) is that it's a standard way to represent things, making use of common tools (e.g. md5sum) much easier.
On Mon, Sep 19, 2011 at 11:11 AM, Greg Sabino Mullane greg@endpoint.com wrote:
I'm also of the opinion that we should just store things as CHAR(32), unless someone thinks space is really at that much of a premium. The big advantage of 32 chars (i.e. 0-9a-f aka hexadecimal ) is that it's a standard way to represent things, making use of common tools (e.g. md5sum) much easier.
This is very sound advice. It also makes debugging much easier as things like PHPMyAdmin "just work", The math table's binary version was quite annoying to try to debug.
Conrad
On Mon, Sep 19, 2011 at 2:11 PM, Greg Sabino Mullane greg@endpoint.com wrote:
- When reverting, do a select count(*) where md5=? and then do something
more advanced when more than one match is found
That sounds very painful :(
-Chad
[snip]
So just FYI -- the only *actual* controversy that needs to be discussed in this thread is "how do we make this update applicable in a way that doesn't disrupt live sites with many millions of pages?"
We're pretty fixed on SHA-1 as a checksum sig (already using it elsewhere) and have no particular desire or need to change or think about alternatives; bikeshedding details of the formatting and storage are not at issue.
-- brion
Since the primary use case here seems to be offline analysis and it may not be of much interest to mediawiki users outside of wmf, can we store the checksums in new tables (i.e. revision_sha1) instead of running large alters, and implement the code to generate checksums on new edits via an extension?
Checksums for most old revs can be generated offline and populated before the extension goes live. Since nothing will be using the new table yet, there'd be no issues with things like gap lock contention on the revision table from mass populating it.
On Mon, Sep 19, 2011 at 12:10 PM, Brion Vibber brion@pobox.com wrote:
[snip]
So just FYI -- the only *actual* controversy that needs to be discussed in this thread is "how do we make this update applicable in a way that doesn't disrupt live sites with many millions of pages?"
We're pretty fixed on SHA-1 as a checksum sig (already using it elsewhere) and have no particular desire or need to change or think about alternatives; bikeshedding details of the formatting and storage are not at issue.
-- brion _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Mon, Sep 19, 2011 at 12:53 PM, Asher Feldman afeldman@wikimedia.orgwrote:
Since the primary use case here seems to be offline analysis and it may not be of much interest to mediawiki users outside of wmf, can we store the checksums in new tables (i.e. revision_sha1) instead of running large alters, and implement the code to generate checksums on new edits via an extension?
Checksums for most old revs can be generated offline and populated before the extension goes live. Since nothing will be using the new table yet, there'd be no issues with things like gap lock contention on the revision table from mass populating it.
That's probably the simplest solution; adding a new empty table will be very quick. It may make it slower to use the field though, depending on what all uses/exposes it.
During stub dump generation for instance this would need to add a left outer join on the other table, and add things to the dump output (and also needs an update to the XML schema for the dump format). This would then need to be preserved through subsequent dump passes as well.
-- brion
On 11-09-19 12:57 PM, Brion Vibber wrote:
On Mon, Sep 19, 2011 at 12:53 PM, Asher Feldman afeldman@wikimedia.orgwrote:
Since the primary use case here seems to be offline analysis and it may not be of much interest to mediawiki users outside of wmf, can we store the checksums in new tables (i.e. revision_sha1) instead of running large alters, and implement the code to generate checksums on new edits via an extension?
Checksums for most old revs can be generated offline and populated before the extension goes live. Since nothing will be using the new table yet, there'd be no issues with things like gap lock contention on the revision table from mass populating it.
That's probably the simplest solution; adding a new empty table will be very quick. It may make it slower to use the field though, depending on what all uses/exposes it.
During stub dump generation for instance this would need to add a left outer join on the other table, and add things to the dump output (and also needs an update to the XML schema for the dump format). This would then need to be preserved through subsequent dump passes as well.
-- brion
Revision is going to need to either make a JOIN whenever it grabs revision info, or make an additional db query whenever someone does use the checksum.
Btw, instead of having Revision return a checksum string and needing to check what type it is with a second method (best to program generically in case we do switch checksum types) how about we return an instance of a simple SHA1 wrapper class. We can have a MD5 one too and use a simple descriptive method instead of having to manually use wfBaseConvert with the right args when you want something good for filesystem use.
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]
On Mon, Sep 19, 2011 at 3:57 PM, Brion Vibber brion@pobox.com wrote:
That's probably the simplest solution; adding a new empty table will be very quick. It may make it slower to use the field though, depending on what all uses/exposes it.
Isn't adding a new column with all NULL values quick too?
On 11-09-19 06:39 PM, Anthony wrote:
On Mon, Sep 19, 2011 at 3:57 PM, Brion Vibber brion@pobox.com wrote:
That's probably the simplest solution; adding a new empty table will be very quick. It may make it slower to use the field though, depending on what all uses/exposes it.
Isn't adding a new column with all NULL values quick too?
Apparently in InnoDB a table ALTER requires an entire copy of the table to do. In other words to do a table alter every box doing it needs to be able to hold the entire Wikipedia revision table twice to add a new column. ...InnoDB also doesn't generally like to give disk space back to the disk when it's done using it.
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]
On Mon, Sep 19, 2011 at 10:39 PM, Daniel Friesen lists@nadir-seen-fire.com wrote:
On 11-09-19 06:39 PM, Anthony wrote:
On Mon, Sep 19, 2011 at 3:57 PM, Brion Vibber brion@pobox.com wrote:
That's probably the simplest solution; adding a new empty table will be very quick. It may make it slower to use the field though, depending on what all uses/exposes it.
Isn't adding a new column with all NULL values quick too?
Apparently in InnoDB a table ALTER requires an entire copy of the table to do. In other words to do a table alter every box doing it needs to be able to hold the entire Wikipedia revision table twice to add a new column.
Ah, okay. I remember that's what happened in MyISAM but I figured they had that fixed in InnoDB.
On Mon, Sep 19, 2011 at 3:57 PM, Brion Vibber brion@pobox.com wrote:
During stub dump generation for instance this would need to add a left outer join on the other table, and add things to the dump output (and also needs an update to the XML schema for the dump format). This would then need to be preserved through subsequent dump passes as well.
Doesn't the stub dump generation computer have its own database? I still don't see the point of putting all this extra work on the master database in order to maintain a function-based index which is only being used for dumps.
The dump generation computer should have its own database. For most of the tables/dumps (probably all but the full-text ones), you could even use sqlite and offer that as a download option for those who aren't stuck in 1998.
On Tue, Sep 20, 2011 at 3:23 AM, Daniel Friesen lists@nadir-seen-fire.com wrote:
On 11-09-19 11:43 PM, Domas Mituzas wrote:
so, what are the use cases and how does one index for them? is it global hash check, per page? etc
One use case I know if is this bug: https://bugzilla.wikimedia.org/show_bug.cgi?id=2939
Calculating and storing checksums on every revision in the database is way overkill for that.
On Tue, Sep 20, 2011 at 9:34 AM, Domas Mituzas midom.lists@gmail.com wrote:
Ah, okay. I remember that's what happened in MyISAM but I figured they had that fixed in InnoDB.
InnoDB has optimized path for index builds, not for schema changes.
No support for built-in function-based indexes, right? (I searched a bit and couldn't find any.)
On 19/09/11 20:11, Greg Sabino Mullane wrote:
supported by many OSs, libraries, etc. As far as collisions, there are plenty of easy solutions, such as:
- Check for a collision before allowing a new revision, and do something
if so (to handle the pre-image attack)
- When reverting, do a select count(*) where md5=? and then do something
more advanced when more than one match is found
- Use the checksum to find the revision fast, but still do a full byte
comparison.
A very simple thing that could be done is to use checksum, then size, and then full text.
- When reverting, do a select count(*) where md5=? and then do something
more advanced when more than one match is found
finally "we don't need an index on it" becomes "we need an index on it", and storage efficiency becomes much more interesting (binary packing yay ;-)
so, what are the use cases and how does one index for them? is it global hash check, per page? etc
Domas
On 11-09-19 11:43 PM, Domas Mituzas wrote:
- When reverting, do a select count(*) where md5=? and then do something
more advanced when more than one match is found
finally "we don't need an index on it" becomes "we need an index on it", and storage efficiency becomes much more interesting (binary packing yay ;-)
so, what are the use cases and how does one index for them? is it global hash check, per page? etc
Domas
One use case I know if is this bug: https://bugzilla.wikimedia.org/show_bug.cgi?id=2939
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]
Domas Mituzas wrote:
- When reverting, do a select count(*) where md5=? and then do something
more advanced when more than one match is found
finally "we don't need an index on it" becomes "we need an index on it", and storage efficiency becomes much more interesting (binary packing yay ;-)
so, what are the use cases and how does one index for them? is it global hash check, per page? etc
Domas
I don't know why people have started asking about talking hashes when reverting. MediaWiki already knows in a rollback that the revision is identical to the previous one. It even avoids storing the same text twice. However, if you want to "check if something is a revert" say clearly that you are testing on every edit if there's a previous revision [for the same page?] with the same text. Is there a "something more advanced to be done" or just dreams? Would for instance the WikiTrust extension make use of it, or instead would need to store its own checksums on another table?
PS: I don't consider bug 2939 a good need. I think it is good to see that there were messages. The provided usecase could be solved with rollback in bot mode.
On 11-09-20 02:26 PM, Platonides wrote:
Domas Mituzas wrote:
- When reverting, do a select count(*) where md5=? and then do something
more advanced when more than one match is found
finally "we don't need an index on it" becomes "we need an index on it", and storage efficiency becomes much more interesting (binary packing yay ;-)
so, what are the use cases and how does one index for them? is it global hash check, per page? etc
Domas
I don't know why people have started asking about talking hashes when reverting. MediaWiki already knows in a rollback that the revision is identical to the previous one. It even avoids storing the same text twice. However, if you want to "check if something is a revert" say clearly that you are testing on every edit if there's a previous revision [for the same page?] with the same text. Is there a "something more advanced to be done" or just dreams? Would for instance the WikiTrust extension make use of it, or instead would need to store its own checksums on another table?
PS: I don't consider bug 2939 a good need. I think it is good to see that there were messages. The provided usecase could be solved with rollback in bot mode.
Care to re-open that bug then and suggest how to implement it? Because that bug was closed as LATER specifically because we don't have a checksum on the revision table.
Also rollbacks are different than undos and manual reverts. On rollbacks we re-use the text field, however on a normal undo or edit I don't believe we scan old revisions for matching content and re-use the text. Hence on any type of revert but a rollback we don't know it's a revert without a checksum. And iirc it was brion that didn't like the idea of having to fetch and compare at least two text blobs when we need to do this.
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]
On Tue, Sep 20, 2011 at 3:37 PM, Happy Melon happy-melon@live.com wrote:
It may or may not be an architecturally-better design to have it as a separate table, although considering how rapidly MW's 'architecture' changes I'd say keeping things as simple as possible is probably a virtue. But that is the basis on which we should be deciding it.
It's an intentional denormalization of the database done apparently for performance reasons (although, I still can't figure out exactly *why* it's being done as it still seems to be useful only for the dump system, and therefore should be part of the dump system, not part of mediawiki proper). It doesn't even seem to apply to "normal", i.e. non-Wikimedia, installations.
On Tue, Sep 20, 2011 at 4:45 PM, Happy Melon happy.melon.wiki@gmail.com wrote:
This is a big project which still retains enthusiasm because we recognise that it has equally big potential to provide interesting new features far beyond the immediate usecases we can construct now (dump validation and 'something to do with reversions').
Can you explain how it's going to help with dump validation? It seems to me that further denormalizing the database is only going to *increase* these sorts of problems.
On Tue, Sep 20, 2011 at 5:36 PM, Anthony wikimail@inbox.org wrote:
On Tue, Sep 20, 2011 at 3:37 PM, Happy Melon happy-melon@live.com wrote:
It may or may not be an architecturally-better design to have it as a separate table, although considering how rapidly MW's 'architecture'
changes
I'd say keeping things as simple as possible is probably a virtue. But
that
is the basis on which we should be deciding it.
It's an intentional denormalization of the database done apparently for performance reasons (although, I still can't figure out exactly *why* it's being done as it still seems to be useful only for the dump system, and therefore should be part of the dump system, not part of mediawiki proper). It doesn't even seem to apply to "normal", i.e. non-Wikimedia, installations.
1) Those dumps are generated by MediaWiki from MediaWiki's database -- try Special:Export on the web UI, some API methods, and the dumpBackup.php maint script family.
2) Checksums would be of fairly obvious benefit to verifying text storage integrity within MediaWiki's own databases (though perhaps best sitting on or keyed to the text table...?) Default installs tend to use simple plain-text or gzipped storage, but big installs like Wikimedia's sites (and not necessarily just us!) optimize storage space by batch-compressing multiple text nodes into a local or remote blobs table.
On Tue, Sep 20, 2011 at 4:45 PM, Happy Melon happy.melon.wiki@gmail.com wrote:
This is a big project which still retains enthusiasm because we recognise that it has equally big potential to provide interesting new features far beyond the immediate usecases we can construct now (dump validation and 'something to do with reversions').
Can you explain how it's going to help with dump validation? It seems to me that further denormalizing the database is only going to *increase* these sorts of problems.
You'd be able to confirm that the text in an XML dump, or accessible through the wiki directly, matches what the database thinks it contains -- and that a given revision hasn't been corrupted by some funky series of accidents in XML dump recycling or External Storage recompression.
IMO that's about the only thing it's really useful for; detecting non-obviously-performed reversions seems like an edge case that's not worth optimizing for, since it would fail to handle lots of cases like reverting partial edits (say an "undo" of a section edit where there are other intermediary edits -- since the other parts of the page text are not identical, you won't get a match on the checksum).
-- brion
Thanks for the explanation. I guess I see what you're getting at now. Sorry I didn't see it sooner.
On Tue, Sep 20, 2011 at 8:50 PM, Brion Vibber brion@pobox.com wrote:
On Tue, Sep 20, 2011 at 5:36 PM, Anthony wikimail@inbox.org wrote:
On Tue, Sep 20, 2011 at 3:37 PM, Happy Melon happy-melon@live.com wrote:
It may or may not be an architecturally-better design to have it as a separate table, although considering how rapidly MW's 'architecture'
changes
I'd say keeping things as simple as possible is probably a virtue. But
that
is the basis on which we should be deciding it.
It's an intentional denormalization of the database done apparently for performance reasons (although, I still can't figure out exactly *why* it's being done as it still seems to be useful only for the dump system, and therefore should be part of the dump system, not part of mediawiki proper). It doesn't even seem to apply to "normal", i.e. non-Wikimedia, installations.
- Those dumps are generated by MediaWiki from MediaWiki's database -- try
Special:Export on the web UI, some API methods, and the dumpBackup.php maint script family.
- Checksums would be of fairly obvious benefit to verifying text storage
integrity within MediaWiki's own databases (though perhaps best sitting on or keyed to the text table...?) Default installs tend to use simple plain-text or gzipped storage, but big installs like Wikimedia's sites (and not necessarily just us!) optimize storage space by batch-compressing multiple text nodes into a local or remote blobs table.
On Tue, Sep 20, 2011 at 4:45 PM, Happy Melon happy.melon.wiki@gmail.com wrote:
This is a big project which still retains enthusiasm because we recognise that it has equally big potential to provide interesting new features far beyond the immediate usecases we can construct now (dump validation and 'something to do with reversions').
Can you explain how it's going to help with dump validation? It seems to me that further denormalizing the database is only going to *increase* these sorts of problems.
You'd be able to confirm that the text in an XML dump, or accessible through the wiki directly, matches what the database thinks it contains -- and that a given revision hasn't been corrupted by some funky series of accidents in XML dump recycling or External Storage recompression.
IMO that's about the only thing it's really useful for; detecting non-obviously-performed reversions seems like an edge case that's not worth optimizing for, since it would fail to handle lots of cases like reverting partial edits (say an "undo" of a section edit where there are other intermediary edits -- since the other parts of the page text are not identical, you won't get a match on the checksum).
-- brion _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Brion Vibber wrote:
- Checksums would be of fairly obvious benefit to verifying text storage
integrity within MediaWiki's own databases (though perhaps best sitting on or keyed to the text table...?) Default installs tend to use simple plain-text or gzipped storage, but big installs like Wikimedia's sites (and not necessarily just us!) optimize storage space by batch-compressing multiple text nodes into a local or remote blobs table.
I also considered the possibility of adding it on the text table. There it could go without a schema change, just as anoher field of the object. But it's too risky. We have already corrupted objects in the past. And it wouldn't be a check as good as bein in revision (eg. suppose that text_ids were mismatched).
Some use cases: * Dump validation (per Ariel) * Revert detection * Collapsing reversions in history to hide clutter * Replacing/augmenting baseRevId hacks in FlaggedRevs
Domas Mituzas wrote:
- When reverting, do a select count(*) where md5=? and then do something
more advanced when more than one match is found
finally "we don't need an index on it" becomes "we need an index on it", and storage efficiency becomes much more interesting (binary packing yay ;-)
so, what are the use cases and how does one index for them? is it global hash check, per page? etc
Domas _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org