Hi all!
I'm working on the database schema for Multi-Content-Revisions (MCR)
<https://www.mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema> and
I'd
like to get rid of the rev_sha1 field:
Maintaining revision hashes (the rev_sha1 field) is expensive, and becomes more
expensive with MCR. With multiple content objects per revision, we need to track
the hash for each slot, and then re-calculate the sha1 for each revision.
That's expensive especially in terms of bytes-per-database-row, which impacts
query performance.
So, what do we need the rev_sha1 field for? As far as I know, nothing in core
uses it, and I'm not aware of any extension using it either. It seems to be used
primarily in offline analysis for detecting (manual) reverts by looking for
revisions with the same hash.
Is that reason enough for dragging all the hashes around the database with every
revision update? Or can we just compute the hashes on the fly for the offline
analysis? Computing hashes is slow since the content needs to be loaded first,
but it would only have to be done for pairs of revisions of the same page with
the same size, which should be a pretty good optimization.
Also, I believe Roan is currently looking for a better mechanism for tracking
all kinds of reverts directly.
So, can we drop rev_sha1?
--
Daniel Kinzler
Principal Platform Engineer
Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.