Would it be possible to generate offline hashes for the bulk of our revision
corpus via dumps and load that into prod to minimize the time and impact of
When using for analysis, will we wish the new columns had partial indexes
(first 6 characters?)
Is code written to populate rev_sha1 on each new edit?
On Thu, Aug 18, 2011 at 7:40 AM, Diederik van Liere <dvanliere(a)gmail.com>wrote;wrote:
I am starting this thread because Brion's revision r94289 reverted
r94289  stating "core schema change with no discussion" .
Bugs 21860  and 25312  advocate for the inclusion of a hash
column (either md5 or sha1) in the revision table. The primary use
case of this column will be to assist detecting reverts. I don't think
that data integrity is the primary reason for adding this column. The
huge advantage of having such a column is that it will not be longer
necessary to analyze full dumps to detect reverts, instead you can
look for reverts in the stub dump file by looking for the same hash
within a single page. The fact that there is a theoretical chance of a
collision is not very important IMHO, it would just mean that in very
rare cases in our research we would flag an edit being reverted while
it's not. The two bug reports contain quite long discussions and this
feature has also been discussed internally quite extensively but oddly
enough it hasn't happened yet on the mailinglist.
So let's have a discussion!
Wikitech-l mailing list