Would it be possible to generate offline hashes for the bulk of our revision corpus via dumps and load that into prod to minimize the time and impact of the backfill?
When using for analysis, will we wish the new columns had partial indexes (first 6 characters?)
Is code written to populate rev_sha1 on each new edit?
On Thu, Aug 18, 2011 at 7:40 AM, Diederik van Liere dvanliere@gmail.comwrote:
Hi! I am starting this thread because Brion's revision r94289 reverted r94289 [0] stating "core schema change with no discussion" [1]. Bugs 21860 [2] and 25312 [3] advocate for the inclusion of a hash column (either md5 or sha1) in the revision table. The primary use case of this column will be to assist detecting reverts. I don't think that data integrity is the primary reason for adding this column. The huge advantage of having such a column is that it will not be longer necessary to analyze full dumps to detect reverts, instead you can look for reverts in the stub dump file by looking for the same hash within a single page. The fact that there is a theoretical chance of a collision is not very important IMHO, it would just mean that in very rare cases in our research we would flag an edit being reverted while it's not. The two bug reports contain quite long discussions and this feature has also been discussed internally quite extensively but oddly enough it hasn't happened yet on the mailinglist.
So let's have a discussion!
[0] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/94289 [1] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/94541 [2] https://bugzilla.wikimedia.org/show_bug.cgi?id=21860 [3] https://bugzilla.wikimedia.org/show_bug.cgi?id=25312
Best,
Diederik
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l