There are two important use cases; one where you want to identify previous reverts, and one where you want to identify close matches. There are other ways to do the first than to use a digest, but the digest opens up for alternate client side algorithms. The last would typically be done by some locally sensitive hashing. In both cases you don't want to download the content of each revision, that is exactly why you want to use some kind of hashes. If the hashes could be requested somehow, perhaps as part of the API, then it should be sufficient. Those hashes could be part of the XML dump too, but if you have the XML-dump and know the algorithm, then you don't need the digest.
There are a specific use case when someone want to verify the content. In those cases you don't want to identify a previous revert, you want to check whether someone has tempered with the downloaded content. As you don't know who might have tempered with the content you should also question the digest delivered by WMF, thus the digest in the database isn't good enough as it is right now. Instead of a sha-digest each revision should be properly signed, but then if you can't trust WMF can you trust their signature? Signatures for revisions should probably be delivered by some external entity and not WMF itselves.
On Fri, Sep 15, 2017 at 11:44 PM, Daniel Kinzler < daniel.kinzler@wikimedia.de> wrote:
A revert restores a previous revision. It covers all slots.
The fact that reverts, watching, protecting, etc still works per page, while you can have multiple kinds of different content on the page, is indeed the point of MCR.
Am 15.09.2017 um 22:23 schrieb C. Scott Ananian:
Alternatively, perhaps "hash" could be an optional part of an MCR chunk? We could keep it for the wikitext, but drop the hash for the metadata,
and
drop any support for a "combined" hash over wikitext + all-other-pieces.
...which begs the question about how reverts work in MCR. Is it just the wikitext which is reverted, or do categories and other metadata revert as well? And perhaps we can just mark these at revert time instead of
trying
to reconstruct it after the fact? --scott
On Fri, Sep 15, 2017 at 4:13 PM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
On 9/15/17 1:06 PM, Andrew Otto wrote:
As a random idea - would it be possible to calculate the hashes
when data is transitioned from SQL to Hadoop storage?
We take monthly snapshots of the entire history, so every month we’d have to pull the content of every revision ever made :o
Why? If you already seen that revision in previous snapshot, you'd already have its hash? Admittedly, I have no idea how the process works, so I am just talking out of general knowledge and may miss some things. Also of course you already have hashes from revs till this day and up to the day we decide to turn the hash off. Starting that day, it'd have to be generated, but I see no reason to generate one more than once? -- Stas Malyshev smalyshev@wikimedia.org
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Daniel Kinzler Principal Platform Engineer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l