Ok, a little more detail here:
For MCR, we would have to keep around the hash of each content object ("slot")
AND of each revision. This makes the revision and content tables "wider", which
is a problem because they grow quite "tall", too. It also means we have to
compute a hash of hashes for each revision, but that's not horrible.
I'm hoping we can remove the hash from both tables. Keeping the hash of each
content object and/or each revision somewhere else is fine with me. Perhaps it's
sufficient to generate it when generating XML dumps. Maybe we want it in hadoop.
Maybe we want to have it in a separate SQL database. But perhaps we don't
actually need it.
Can someone explain *why* they want the hash at all?
Am 15.09.2017 um 22:01 schrieb Stas Malyshev:
Hi!
We should hear from Joseph, Dan, Marcel, and
Aaron H on this I think, but
from the little I know:
Most analytical computations (for things like reverts, as you say) don’t
have easy access to content, so computing SHAs on the fly is pretty hard.
MediaWiki history reconstruction relies on the SHA to figure out what
revisions revert other revisions, as there is no reliable way to know if
something is a revert other than by comparing SHAs.
As a random idea - would it be possible to calculate the hashes when
data is transitioned from SQL to Hadoop storage? I imagine that would
slow down the transition, but not sure if it'd be substantial or not. If
we're using the hash just to compare revisions, we could also use
different hash (maybe non-crypto hash?) which may be faster.
--
Daniel Kinzler
Principal Platform Engineer
Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.