Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

15 Sep 2017

What I wonder is – does this *need* to be a part of the database table, or
can it be a dataset generated from each revision and then published
separately? This way each user wouldn’t have to individually compute the
hashes while we also get the (ostensible) benefit of getting them out of
the table.

On September 15, 2017 at 12:41:03 PM, Andrew Otto (otto(a)wikimedia.org)
wrote:

We should hear from Joseph, Dan, Marcel, and Aaron H on this I think, but
from the little I know:

Most analytical computations (for things like reverts, as you say) don’t
have easy access to content, so computing SHAs on the fly is pretty hard.
MediaWiki history reconstruction relies on the SHA to figure out what
revisions revert other revisions, as there is no reliable way to know if
something is a revert other than by comparing SHAs.

See
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_his…
(particularly the *revert* fields).

On Fri, Sep 15, 2017 at 1:49 PM, Erik Zachte &lt;ezachte(a)wikimedia.org&gt; wrote:

...
  Compute the hashes on the fly for the offline analysis
doesn’t work for
 Wikistats 1.0, as it only parses the stub dumps, without article content,
 just metadata.
 Parsing the full archive dumps is a quite expensive, time-wise.

 This may change with Wikistats 2.0 with has a totally different process
 flow. That I can't tell.

 Erik Zachte

 -----Original Message-----
 From: Wikitech-l [mailto:wikitech-l-bounces@lists.wikimedia.org] On
 Behalf Of Daniel Kinzler
 Sent: Friday, September 15, 2017 12:52
 To: Wikimedia developers &lt;wikitech-l(a)lists.wikimedia.org&gt;
 Subject: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

 Hi all!

 I'm working on the database schema for Multi-Content-Revisions (MCR) <
 https://www.mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema>
 and I'd like to get rid of the rev_sha1 field:

 Maintaining revision hashes (the rev_sha1 field) is expensive, and becomes
 more expensive with MCR. With multiple content objects per revision, we
 need to track the hash for each slot, and then re-calculate the sha1 for
 each revision.

 That's expensive especially in terms of bytes-per-database-row, which
 impacts query performance.

 So, what do we need the rev_sha1 field for? As far as I know, nothing in
 core uses it, and I'm not aware of any extension using it either. It seems
 to be used primarily in offline analysis for detecting (manual) reverts by
 looking for revisions with the same hash.

 Is that reason enough for dragging all the hashes around the database with
 every revision update? Or can we just compute the hashes on the fly for the
...
  offline analysis? Computing hashes is slow since the
content needs to be
 loaded first, but it would only have to be done for pairs of revisions of
 the same page with the same size, which should be a pretty good
 optimization.

 Also, I believe Roan is currently looking for a better mechanism for
 tracking all kinds of reverts directly.

 So, can we drop rev_sha1?

 --
 Daniel Kinzler
 Principal Platform Engineer

 Wikimedia Deutschland
 Gesellschaft zur Förderung Freien Wissens e.V.

 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l
 _______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?