Compute the hashes on the fly for the offline analysis doesn’t work for Wikistats 1.0, as
it only parses the stub dumps, without article content, just metadata.
Parsing the full archive dumps is a quite expensive, time-wise.
This may change with Wikistats 2.0 with has a totally different process flow. That I
From: Wikitech-l [mailto:firstname.lastname@example.org] On Behalf Of Daniel
Sent: Friday, September 15, 2017 12:52
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
Subject: [Wikitech-l] Can we drop revision hashes (rev_sha1)?
I'm working on the database schema for Multi-Content-Revisions (MCR)
<https://www.mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema> and I'd
like to get rid of the rev_sha1 field:
Maintaining revision hashes (the rev_sha1 field) is expensive, and becomes more expensive
with MCR. With multiple content objects per revision, we need to track the hash for each
slot, and then re-calculate the sha1 for each revision.
That's expensive especially in terms of bytes-per-database-row, which impacts query
So, what do we need the rev_sha1 field for? As far as I know, nothing in core uses it, and
I'm not aware of any extension using it either. It seems to be used primarily in
offline analysis for detecting (manual) reverts by looking for revisions with the same
Is that reason enough for dragging all the hashes around the database with every revision
update? Or can we just compute the hashes on the fly for the offline analysis? Computing
hashes is slow since the content needs to be loaded first, but it would only have to be
done for pairs of revisions of the same page with the same size, which should be a pretty
Also, I believe Roan is currently looking for a better mechanism for tracking all kinds of
So, can we drop rev_sha1?
Principal Platform Engineer
Gesellschaft zur Förderung Freien Wissens e.V.
Wikitech-l mailing list