Hi all!
I'm working on the database schema for Multi-Content-Revisions (MCR) https://www.mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema and I'd like to get rid of the rev_sha1 field:
Maintaining revision hashes (the rev_sha1 field) is expensive, and becomes more expensive with MCR. With multiple content objects per revision, we need to track the hash for each slot, and then re-calculate the sha1 for each revision.
That's expensive especially in terms of bytes-per-database-row, which impacts query performance.
So, what do we need the rev_sha1 field for? As far as I know, nothing in core uses it, and I'm not aware of any extension using it either. It seems to be used primarily in offline analysis for detecting (manual) reverts by looking for revisions with the same hash.
Is that reason enough for dragging all the hashes around the database with every revision update? Or can we just compute the hashes on the fly for the offline analysis? Computing hashes is slow since the content needs to be loaded first, but it would only have to be done for pairs of revisions of the same page with the same size, which should be a pretty good optimization.
Also, I believe Roan is currently looking for a better mechanism for tracking all kinds of reverts directly.
So, can we drop rev_sha1?