Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

15 Sep 2017


      ...
can it be a dataset generated from each revision and then published
separately?
Perhaps it be generated asynchronously via a job?  Either stored in
revision or a separate table.
On Fri, Sep 15, 2017 at 4:06 PM, Andrew Otto otto@wikimedia.org wrote:
...
...
As a random idea - would it be possible to calculate the hashes when data
is transitioned from SQL to Hadoop storage?
We take monthly snapshots of the entire history, so every month we’d have
to pull the content of every revision ever made :o
On Fri, Sep 15, 2017 at 4:01 PM, Stas Malyshev smalyshev@wikimedia.org
wrote:
...
Hi!
...
We should hear from Joseph, Dan, Marcel, and Aaron H on this I think,
but
...
from the little I know:
Most analytical computations (for things like reverts, as you say) don’t
have easy access to content, so computing SHAs on the fly is pretty
hard.
...
MediaWiki history reconstruction relies on the SHA to figure out what
revisions revert other revisions, as there is no reliable way to know if
something is a revert other than by comparing SHAs.
As a random idea - would it be possible to calculate the hashes when
data is transitioned from SQL to Hadoop storage? I imagine that would
slow down the transition, but not sure if it'd be substantial or not. If
we're using the hash just to compare revisions, we could also use
different hash (maybe non-crypto hash?) which may be faster.
--
Stas Malyshev
smalyshev@wikimedia.org

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?