There are two important use cases; one where you want to identify previous
reverts, and one where you want to identify close matches. There are other
ways to do the first than to use a digest, but the digest opens up for
alternate client side algorithms. The last would typically be done by some
locally sensitive hashing. In both cases you don't want to download the
content of each revision, that is exactly why you want to use some kind of
hashes. If the hashes could be requested somehow, perhaps as part of the
API, then it should be sufficient. Those hashes could be part of the XML
dump too, but if you have the XML-dump and know the algorithm, then you
don't need the digest.
There are a specific use case when someone want to verify the content. In
those cases you don't want to identify a previous revert, you want to check
whether someone has tempered with the downloaded content. As you don't know
who might have tempered with the content you should also question the
digest delivered by WMF, thus the digest in the database isn't good enough
as it is right now. Instead of a sha-digest each revision should be
properly signed, but then if you can't trust WMF can you trust their
signature? Signatures for revisions should probably be delivered by some
external entity and not WMF itselves.
On Fri, Sep 15, 2017 at 11:44 PM, Daniel Kinzler <
daniel.kinzler(a)wikimedia.de> wrote:
A revert restores a previous revision. It covers all
slots.
The fact that reverts, watching, protecting, etc still works per page,
while you
can have multiple kinds of different content on the page, is indeed the
point of
MCR.
Am 15.09.2017 um 22:23 schrieb C. Scott Ananian:
Alternatively, perhaps "hash" could be
an optional part of an MCR chunk?
We could keep it for the wikitext, but drop the hash for the metadata,
and
drop any support for a "combined" hash
over wikitext + all-other-pieces.
...which begs the question about how reverts work in MCR. Is it just the
wikitext which is reverted, or do categories and other metadata revert as
well? And perhaps we can just mark these at revert time instead of
trying
to reconstruct it after the fact?
--scott
On Fri, Sep 15, 2017 at 4:13 PM, Stas Malyshev <smalyshev(a)wikimedia.org>
wrote:
Hi!
On 9/15/17 1:06 PM, Andrew Otto wrote:
> As a random idea - would it be possible to
calculate the hashes
when data is transitioned from SQL to Hadoop storage?
We take monthly snapshots of the entire history, so every month we’d
have to pull the content of every revision ever made :o
Why? If you already seen that revision in previous snapshot, you'd
already have its hash? Admittedly, I have no idea how the process works,
so I am just talking out of general knowledge and may miss some things.
Also of course you already have hashes from revs till this day and up to
the day we decide to turn the hash off. Starting that day, it'd have to
be generated, but I see no reason to generate one more than once?
--
Stas Malyshev
smalyshev(a)wikimedia.org
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
--
Daniel Kinzler
Principal Platform Engineer
Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l