Hi all!
I'm working on the database schema for Multi-Content-Revisions (MCR) https://www.mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema and I'd like to get rid of the rev_sha1 field:
Maintaining revision hashes (the rev_sha1 field) is expensive, and becomes more expensive with MCR. With multiple content objects per revision, we need to track the hash for each slot, and then re-calculate the sha1 for each revision.
That's expensive especially in terms of bytes-per-database-row, which impacts query performance.
So, what do we need the rev_sha1 field for? As far as I know, nothing in core uses it, and I'm not aware of any extension using it either. It seems to be used primarily in offline analysis for detecting (manual) reverts by looking for revisions with the same hash.
Is that reason enough for dragging all the hashes around the database with every revision update? Or can we just compute the hashes on the fly for the offline analysis? Computing hashes is slow since the content needs to be loaded first, but it would only have to be done for pairs of revisions of the same page with the same size, which should be a pretty good optimization.
Also, I believe Roan is currently looking for a better mechanism for tracking all kinds of reverts directly.
So, can we drop rev_sha1?
Compute the hashes on the fly for the offline analysis doesn’t work for Wikistats 1.0, as it only parses the stub dumps, without article content, just metadata. Parsing the full archive dumps is a quite expensive, time-wise.
This may change with Wikistats 2.0 with has a totally different process flow. That I can't tell.
Erik Zachte
-----Original Message----- From: Wikitech-l [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Daniel Kinzler Sent: Friday, September 15, 2017 12:52 To: Wikimedia developers wikitech-l@lists.wikimedia.org Subject: [Wikitech-l] Can we drop revision hashes (rev_sha1)?
Hi all!
I'm working on the database schema for Multi-Content-Revisions (MCR) https://www.mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema and I'd like to get rid of the rev_sha1 field:
Maintaining revision hashes (the rev_sha1 field) is expensive, and becomes more expensive with MCR. With multiple content objects per revision, we need to track the hash for each slot, and then re-calculate the sha1 for each revision.
That's expensive especially in terms of bytes-per-database-row, which impacts query performance.
So, what do we need the rev_sha1 field for? As far as I know, nothing in core uses it, and I'm not aware of any extension using it either. It seems to be used primarily in offline analysis for detecting (manual) reverts by looking for revisions with the same hash.
Is that reason enough for dragging all the hashes around the database with every revision update? Or can we just compute the hashes on the fly for the offline analysis? Computing hashes is slow since the content needs to be loaded first, but it would only have to be done for pairs of revisions of the same page with the same size, which should be a pretty good optimization.
Also, I believe Roan is currently looking for a better mechanism for tracking all kinds of reverts directly.
So, can we drop rev_sha1?
-- Daniel Kinzler Principal Platform Engineer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
We should hear from Joseph, Dan, Marcel, and Aaron H on this I think, but from the little I know:
Most analytical computations (for things like reverts, as you say) don’t have easy access to content, so computing SHAs on the fly is pretty hard. MediaWiki history reconstruction relies on the SHA to figure out what revisions revert other revisions, as there is no reliable way to know if something is a revert other than by comparing SHAs.
See https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_hist... (particularly the *revert* fields).
On Fri, Sep 15, 2017 at 1:49 PM, Erik Zachte ezachte@wikimedia.org wrote:
Compute the hashes on the fly for the offline analysis doesn’t work for Wikistats 1.0, as it only parses the stub dumps, without article content, just metadata. Parsing the full archive dumps is a quite expensive, time-wise.
This may change with Wikistats 2.0 with has a totally different process flow. That I can't tell.
Erik Zachte
-----Original Message----- From: Wikitech-l [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Daniel Kinzler Sent: Friday, September 15, 2017 12:52 To: Wikimedia developers wikitech-l@lists.wikimedia.org Subject: [Wikitech-l] Can we drop revision hashes (rev_sha1)?
Hi all!
I'm working on the database schema for Multi-Content-Revisions (MCR) < https://www.mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema%3E and I'd like to get rid of the rev_sha1 field:
Maintaining revision hashes (the rev_sha1 field) is expensive, and becomes more expensive with MCR. With multiple content objects per revision, we need to track the hash for each slot, and then re-calculate the sha1 for each revision.
That's expensive especially in terms of bytes-per-database-row, which impacts query performance.
So, what do we need the rev_sha1 field for? As far as I know, nothing in core uses it, and I'm not aware of any extension using it either. It seems to be used primarily in offline analysis for detecting (manual) reverts by looking for revisions with the same hash.
Is that reason enough for dragging all the hashes around the database with every revision update? Or can we just compute the hashes on the fly for the offline analysis? Computing hashes is slow since the content needs to be loaded first, but it would only have to be done for pairs of revisions of the same page with the same size, which should be a pretty good optimization.
Also, I believe Roan is currently looking for a better mechanism for tracking all kinds of reverts directly.
So, can we drop rev_sha1?
-- Daniel Kinzler Principal Platform Engineer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
What I wonder is – does this *need* to be a part of the database table, or can it be a dataset generated from each revision and then published separately? This way each user wouldn’t have to individually compute the hashes while we also get the (ostensible) benefit of getting them out of the table.
On September 15, 2017 at 12:41:03 PM, Andrew Otto (otto@wikimedia.org) wrote:
We should hear from Joseph, Dan, Marcel, and Aaron H on this I think, but from the little I know:
Most analytical computations (for things like reverts, as you say) don’t have easy access to content, so computing SHAs on the fly is pretty hard. MediaWiki history reconstruction relies on the SHA to figure out what revisions revert other revisions, as there is no reliable way to know if something is a revert other than by comparing SHAs.
See https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_hist... (particularly the *revert* fields).
On Fri, Sep 15, 2017 at 1:49 PM, Erik Zachte ezachte@wikimedia.org wrote:
Compute the hashes on the fly for the offline analysis doesn’t work for Wikistats 1.0, as it only parses the stub dumps, without article content, just metadata. Parsing the full archive dumps is a quite expensive, time-wise.
This may change with Wikistats 2.0 with has a totally different process flow. That I can't tell.
Erik Zachte
-----Original Message----- From: Wikitech-l [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Daniel Kinzler Sent: Friday, September 15, 2017 12:52 To: Wikimedia developers wikitech-l@lists.wikimedia.org Subject: [Wikitech-l] Can we drop revision hashes (rev_sha1)?
Hi all!
I'm working on the database schema for Multi-Content-Revisions (MCR) < https://www.mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema%3E and I'd like to get rid of the rev_sha1 field:
Maintaining revision hashes (the rev_sha1 field) is expensive, and becomes more expensive with MCR. With multiple content objects per revision, we need to track the hash for each slot, and then re-calculate the sha1 for each revision.
That's expensive especially in terms of bytes-per-database-row, which impacts query performance.
So, what do we need the rev_sha1 field for? As far as I know, nothing in core uses it, and I'm not aware of any extension using it either. It seems to be used primarily in offline analysis for detecting (manual) reverts by looking for revisions with the same hash.
Is that reason enough for dragging all the hashes around the database with every revision update? Or can we just compute the hashes on the fly for
the
offline analysis? Computing hashes is slow since the content needs to be loaded first, but it would only have to be done for pairs of revisions of the same page with the same size, which should be a pretty good optimization.
Also, I believe Roan is currently looking for a better mechanism for tracking all kinds of reverts directly.
So, can we drop rev_sha1?
-- Daniel Kinzler Principal Platform Engineer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
We could keep it in the XML dumps (it's part of the XSD after all)...just compute it at export time. Not terribly hard, I don't think, we should have the parsed content already on hand....
-Chad
On Fri, Sep 15, 2017 at 12:51 PM James Hare jamesmhare@gmail.com wrote:
What I wonder is – does this *need* to be a part of the database table, or can it be a dataset generated from each revision and then published separately? This way each user wouldn’t have to individually compute the hashes while we also get the (ostensible) benefit of getting them out of the table.
On September 15, 2017 at 12:41:03 PM, Andrew Otto (otto@wikimedia.org) wrote:
We should hear from Joseph, Dan, Marcel, and Aaron H on this I think, but from the little I know:
Most analytical computations (for things like reverts, as you say) don’t have easy access to content, so computing SHAs on the fly is pretty hard. MediaWiki history reconstruction relies on the SHA to figure out what revisions revert other revisions, as there is no reliable way to know if something is a revert other than by comparing SHAs.
See
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_hist... (particularly the *revert* fields).
On Fri, Sep 15, 2017 at 1:49 PM, Erik Zachte ezachte@wikimedia.org wrote:
Compute the hashes on the fly for the offline analysis doesn’t work for Wikistats 1.0, as it only parses the stub dumps, without article content, just metadata. Parsing the full archive dumps is a quite expensive, time-wise.
This may change with Wikistats 2.0 with has a totally different process flow. That I can't tell.
Erik Zachte
-----Original Message----- From: Wikitech-l [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Daniel Kinzler Sent: Friday, September 15, 2017 12:52 To: Wikimedia developers wikitech-l@lists.wikimedia.org Subject: [Wikitech-l] Can we drop revision hashes (rev_sha1)?
Hi all!
I'm working on the database schema for Multi-Content-Revisions (MCR) < https://www.mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema%3E and I'd like to get rid of the rev_sha1 field:
Maintaining revision hashes (the rev_sha1 field) is expensive, and
becomes
more expensive with MCR. With multiple content objects per revision, we need to track the hash for each slot, and then re-calculate the sha1 for each revision.
That's expensive especially in terms of bytes-per-database-row, which impacts query performance.
So, what do we need the rev_sha1 field for? As far as I know, nothing in core uses it, and I'm not aware of any extension using it either. It
seems
to be used primarily in offline analysis for detecting (manual) reverts
by
looking for revisions with the same hash.
Is that reason enough for dragging all the hashes around the database
with
every revision update? Or can we just compute the hashes on the fly for
the
offline analysis? Computing hashes is slow since the content needs to be loaded first, but it would only have to be done for pairs of revisions of the same page with the same size, which should be a pretty good optimization.
Also, I believe Roan is currently looking for a better mechanism for tracking all kinds of reverts directly.
So, can we drop rev_sha1?
-- Daniel Kinzler Principal Platform Engineer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hi!
We should hear from Joseph, Dan, Marcel, and Aaron H on this I think, but from the little I know:
Most analytical computations (for things like reverts, as you say) don’t have easy access to content, so computing SHAs on the fly is pretty hard. MediaWiki history reconstruction relies on the SHA to figure out what revisions revert other revisions, as there is no reliable way to know if something is a revert other than by comparing SHAs.
As a random idea - would it be possible to calculate the hashes when data is transitioned from SQL to Hadoop storage? I imagine that would slow down the transition, but not sure if it'd be substantial or not. If we're using the hash just to compare revisions, we could also use different hash (maybe non-crypto hash?) which may be faster.
As a random idea - would it be possible to calculate the hashes when data
is transitioned from SQL to Hadoop storage?
We take monthly snapshots of the entire history, so every month we’d have to pull the content of every revision ever made :o
On Fri, Sep 15, 2017 at 4:01 PM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
We should hear from Joseph, Dan, Marcel, and Aaron H on this I think, but from the little I know:
Most analytical computations (for things like reverts, as you say) don’t have easy access to content, so computing SHAs on the fly is pretty hard. MediaWiki history reconstruction relies on the SHA to figure out what revisions revert other revisions, as there is no reliable way to know if something is a revert other than by comparing SHAs.
As a random idea - would it be possible to calculate the hashes when data is transitioned from SQL to Hadoop storage? I imagine that would slow down the transition, but not sure if it'd be substantial or not. If we're using the hash just to compare revisions, we could also use different hash (maybe non-crypto hash?) which may be faster.
-- Stas Malyshev smalyshev@wikimedia.org
can it be a dataset generated from each revision and then published
separately?
Perhaps it be generated asynchronously via a job? Either stored in revision or a separate table.
On Fri, Sep 15, 2017 at 4:06 PM, Andrew Otto otto@wikimedia.org wrote:
As a random idea - would it be possible to calculate the hashes when data
is transitioned from SQL to Hadoop storage?
We take monthly snapshots of the entire history, so every month we’d have to pull the content of every revision ever made :o
On Fri, Sep 15, 2017 at 4:01 PM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
We should hear from Joseph, Dan, Marcel, and Aaron H on this I think,
but
from the little I know:
Most analytical computations (for things like reverts, as you say) don’t have easy access to content, so computing SHAs on the fly is pretty
hard.
MediaWiki history reconstruction relies on the SHA to figure out what revisions revert other revisions, as there is no reliable way to know if something is a revert other than by comparing SHAs.
As a random idea - would it be possible to calculate the hashes when data is transitioned from SQL to Hadoop storage? I imagine that would slow down the transition, but not sure if it'd be substantial or not. If we're using the hash just to compare revisions, we could also use different hash (maybe non-crypto hash?) which may be faster.
-- Stas Malyshev smalyshev@wikimedia.org
Hi!
On 9/15/17 1:06 PM, Andrew Otto wrote:
As a random idea - would it be possible to calculate the hashes
when data is transitioned from SQL to Hadoop storage?
We take monthly snapshots of the entire history, so every month we’d have to pull the content of every revision ever made :o
Why? If you already seen that revision in previous snapshot, you'd already have its hash? Admittedly, I have no idea how the process works, so I am just talking out of general knowledge and may miss some things. Also of course you already have hashes from revs till this day and up to the day we decide to turn the hash off. Starting that day, it'd have to be generated, but I see no reason to generate one more than once?
Alternatively, perhaps "hash" could be an optional part of an MCR chunk? We could keep it for the wikitext, but drop the hash for the metadata, and drop any support for a "combined" hash over wikitext + all-other-pieces.
...which begs the question about how reverts work in MCR. Is it just the wikitext which is reverted, or do categories and other metadata revert as well? And perhaps we can just mark these at revert time instead of trying to reconstruct it after the fact? --scott
On Fri, Sep 15, 2017 at 4:13 PM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
On 9/15/17 1:06 PM, Andrew Otto wrote:
As a random idea - would it be possible to calculate the hashes
when data is transitioned from SQL to Hadoop storage?
We take monthly snapshots of the entire history, so every month we’d have to pull the content of every revision ever made :o
Why? If you already seen that revision in previous snapshot, you'd already have its hash? Admittedly, I have no idea how the process works, so I am just talking out of general knowledge and may miss some things. Also of course you already have hashes from revs till this day and up to the day we decide to turn the hash off. Starting that day, it'd have to be generated, but I see no reason to generate one more than once? -- Stas Malyshev smalyshev@wikimedia.org
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
A revert restores a previous revision. It covers all slots.
The fact that reverts, watching, protecting, etc still works per page, while you can have multiple kinds of different content on the page, is indeed the point of MCR.
Am 15.09.2017 um 22:23 schrieb C. Scott Ananian:
Alternatively, perhaps "hash" could be an optional part of an MCR chunk? We could keep it for the wikitext, but drop the hash for the metadata, and drop any support for a "combined" hash over wikitext + all-other-pieces.
...which begs the question about how reverts work in MCR. Is it just the wikitext which is reverted, or do categories and other metadata revert as well? And perhaps we can just mark these at revert time instead of trying to reconstruct it after the fact? --scott
On Fri, Sep 15, 2017 at 4:13 PM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
On 9/15/17 1:06 PM, Andrew Otto wrote:
As a random idea - would it be possible to calculate the hashes
when data is transitioned from SQL to Hadoop storage?
We take monthly snapshots of the entire history, so every month we’d have to pull the content of every revision ever made :o
Why? If you already seen that revision in previous snapshot, you'd already have its hash? Admittedly, I have no idea how the process works, so I am just talking out of general knowledge and may miss some things. Also of course you already have hashes from revs till this day and up to the day we decide to turn the hash off. Starting that day, it'd have to be generated, but I see no reason to generate one more than once? -- Stas Malyshev smalyshev@wikimedia.org
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
There are two important use cases; one where you want to identify previous reverts, and one where you want to identify close matches. There are other ways to do the first than to use a digest, but the digest opens up for alternate client side algorithms. The last would typically be done by some locally sensitive hashing. In both cases you don't want to download the content of each revision, that is exactly why you want to use some kind of hashes. If the hashes could be requested somehow, perhaps as part of the API, then it should be sufficient. Those hashes could be part of the XML dump too, but if you have the XML-dump and know the algorithm, then you don't need the digest.
There are a specific use case when someone want to verify the content. In those cases you don't want to identify a previous revert, you want to check whether someone has tempered with the downloaded content. As you don't know who might have tempered with the content you should also question the digest delivered by WMF, thus the digest in the database isn't good enough as it is right now. Instead of a sha-digest each revision should be properly signed, but then if you can't trust WMF can you trust their signature? Signatures for revisions should probably be delivered by some external entity and not WMF itselves.
On Fri, Sep 15, 2017 at 11:44 PM, Daniel Kinzler < daniel.kinzler@wikimedia.de> wrote:
A revert restores a previous revision. It covers all slots.
The fact that reverts, watching, protecting, etc still works per page, while you can have multiple kinds of different content on the page, is indeed the point of MCR.
Am 15.09.2017 um 22:23 schrieb C. Scott Ananian:
Alternatively, perhaps "hash" could be an optional part of an MCR chunk? We could keep it for the wikitext, but drop the hash for the metadata,
and
drop any support for a "combined" hash over wikitext + all-other-pieces.
...which begs the question about how reverts work in MCR. Is it just the wikitext which is reverted, or do categories and other metadata revert as well? And perhaps we can just mark these at revert time instead of
trying
to reconstruct it after the fact? --scott
On Fri, Sep 15, 2017 at 4:13 PM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
On 9/15/17 1:06 PM, Andrew Otto wrote:
As a random idea - would it be possible to calculate the hashes
when data is transitioned from SQL to Hadoop storage?
We take monthly snapshots of the entire history, so every month we’d have to pull the content of every revision ever made :o
Why? If you already seen that revision in previous snapshot, you'd already have its hash? Admittedly, I have no idea how the process works, so I am just talking out of general knowledge and may miss some things. Also of course you already have hashes from revs till this day and up to the day we decide to turn the hash off. Starting that day, it'd have to be generated, but I see no reason to generate one more than once? -- Stas Malyshev smalyshev@wikimedia.org
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Daniel Kinzler Principal Platform Engineer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Ok, a little more detail here:
For MCR, we would have to keep around the hash of each content object ("slot") AND of each revision. This makes the revision and content tables "wider", which is a problem because they grow quite "tall", too. It also means we have to compute a hash of hashes for each revision, but that's not horrible.
I'm hoping we can remove the hash from both tables. Keeping the hash of each content object and/or each revision somewhere else is fine with me. Perhaps it's sufficient to generate it when generating XML dumps. Maybe we want it in hadoop. Maybe we want to have it in a separate SQL database. But perhaps we don't actually need it.
Can someone explain *why* they want the hash at all?
Am 15.09.2017 um 22:01 schrieb Stas Malyshev:
Hi!
We should hear from Joseph, Dan, Marcel, and Aaron H on this I think, but from the little I know:
Most analytical computations (for things like reverts, as you say) don’t have easy access to content, so computing SHAs on the fly is pretty hard. MediaWiki history reconstruction relies on the SHA to figure out what revisions revert other revisions, as there is no reliable way to know if something is a revert other than by comparing SHAs.
As a random idea - would it be possible to calculate the hashes when data is transitioned from SQL to Hadoop storage? I imagine that would slow down the transition, but not sure if it'd be substantial or not. If we're using the hash just to compare revisions, we could also use different hash (maybe non-crypto hash?) which may be faster.
Am 15.09.2017 um 19:49 schrieb Erik Zachte:
Compute the hashes on the fly for the offline analysis doesn’t work for Wikistats 1.0, as it only parses the stub dumps, without article content, just metadata. Parsing the full archive dumps is a quite expensive, time-wise.
We can always compute the hash when outputting XML dumps that contain the full content (it's already loaded, so no big deal), and then generate the XML dump with only meta-data from the full dump.
On 09/15/2017 06:51 AM, Daniel Kinzler wrote:
Also, I believe Roan is currently looking for a better mechanism for tracking all kinds of reverts directly.
Let's see if we want to use rev_sha1 for that better solution (a way to track reverts within MW itself) before we drop it.
I know Roan is planning to write an RFC on reverts.
Matt
At a quick glance, EventBus and FlaggedRevs are the two extensions using the hashes. EventBust just puts them into the emitted data; FlaggedRevs detects reverts to the latest stable revision that way (so there is no rev_sha1 based lookup in either case, although in the case of FlaggedRevs I could imagine a use case for something like that).
Files on the other hand use hash lookups a lot, and AIUI they are planned to become MCR slots eventually.
For a quick win, you could just reduce the hash size. We have around a billion revisions, and probably won't ever have more than a trillion; square that for birthday effect and add a couple extra zeros just to be sure, and it still fits comfortably into 80 bits. If hashes only need to be unique within the same page then maybe 30-40.
Am 16.09.2017 um 01:22 schrieb Matthew Flaschen:
On 09/15/2017 06:51 AM, Daniel Kinzler wrote:
Also, I believe Roan is currently looking for a better mechanism for tracking all kinds of reverts directly.
Let's see if we want to use rev_sha1 for that better solution (a way to track reverts within MW itself) before we drop it.
The problem is that if we don't drop is, we have to *introduce* it for the new content table for MCR. I'd like to avoid that.
I guess we can define the field and just null it, but... well. I'd like to avoid that.
So, as things stand, rev_sha1 in the database is used for:
1. the XML dumps process and all the researchers depending on the XML dumps (probably just for revert detection) 2. revert detection for libraries like python-mwreverts [1] 3. revert detection in mediawiki history reconstruction processes in Hadoop (Wikistats 2.0) 4. revert detection in Wikistats 1.0 5. revert detection for tools that run on labs, like Wikimetrics ?. I think Aaron also uses rev_sha1 in ORES, but I can't seem to find the latest code for that service
If you think about this list above as a flow of data, you'll see that rev_sha1 is replicated to xml, labs databases, hadoop, ML models, etc. So removing it and adding it back downstream from the main mediawiki database somewhere, like in XML, cuts off the other places that need it. That means it must be available either in the mediawiki database or in some other central database which all those other consumers can pull from.
I defer to your expertise when you say it's expensive to keep in the db, and I can see how that would get much worse with MCR. I'm sure we can figure something out, though. Right now it seems like our options are, as others have pointed out:
* compute async and store in DB or somewhere else that's central and easy to access from all the branches I mentioned * update how we detect reverts and keep a revert database with good references to wiki_db, rev_id so it can be brought back in context.
Personally, I would love to get better revert detection, using sha1 exact matches doesn't really get to the heart of the issue. Important phenomena like revert wars, bullying, and stalking are hiding behind bad revert detection. I'm happy to brainstorm ways we can use Analytics infrastructure to do this. We definitely have the tools necessary, but not so much the man-power. That said, please don't strip out rev_sha1 until we've accounted for all its "data customers".
So, put another way, I think it's totally fine if we say ok everyone, from date XYZ, you will no longer have rev_sha1 in the database, but if you want to know whether an edit reverts a previous edit or a series of edits, go *HERE*. That's fine. And just for context, here's how we do our revert detection in Hadoop (it's pretty fancy) [2].
[1] https://github.com/mediawiki-utilities/python-mwreverts [2] https://github.com/wikimedia/analytics-refinery-source/blob/1d38b8e4acfd10dc...
On Mon, Sep 18, 2017 at 9:19 AM, Daniel Kinzler <daniel.kinzler@wikimedia.de
wrote:
Am 16.09.2017 um 01:22 schrieb Matthew Flaschen:
On 09/15/2017 06:51 AM, Daniel Kinzler wrote:
Also, I believe Roan is currently looking for a better mechanism for
tracking
all kinds of reverts directly.
Let's see if we want to use rev_sha1 for that better solution (a way to
track
reverts within MW itself) before we drop it.
The problem is that if we don't drop is, we have to *introduce* it for the new content table for MCR. I'd like to avoid that.
I guess we can define the field and just null it, but... well. I'd like to avoid that.
-- Daniel Kinzler Principal Platform Engineer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
---------- Původní e-mail ---------- Od: Dan Andreescu dandreescu@wikimedia.org Komu: Wikimedia developers wikitech-l@lists.wikimedia.org Datum: 18. 9. 2017 16:26:18 Předmět: Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)? "So, as things stand, rev_sha1 in the database is used for:
1. the XML dumps process and all the researchers depending on the XML dumps (probably just for revert detection) 2. revert detection for libraries like python-mwreverts [1] 3. revert detection in mediawiki history reconstruction processes in Hadoop (Wikistats 2.0) 4. revert detection in Wikistats 1.0 5. revert detection for tools that run on labs, like Wikimetrics ?. I think Aaron also uses rev_sha1 in ORES, but I can't seem to find the latest code for that service
If you think about this list above as a flow of data, you'll see that rev_sha1 is replicated to xml, labs databases, hadoop, ML models, etc. So removing it and adding it back downstream from the main mediawiki database somewhere, like in XML, cuts off the other places that need it. That means it must be available either in the mediawiki database or in some other central database which all those other consumers can pull from. "
I use rev_sha1 on replicas to check the consistency of modules, templates or other pages (typically help) which should be same between projects (either within one language or even crosslanguage, if the page is not language dependent). In other words to detect possible changes in them and syncing them.
Also, I haven't noticed it mentioned in the thread: Flow also notices users on reverts, but IDK whether it uses rev_sha1 or not. So I'm rather mentioning it.
Kind regards
Danny B.
I am not a mediawiki developer, but shouldn't sha1 be moved instead of deleted/not deleted? Moved to the content table- so it is kept unaltered.
That way it can be used for all the goals that have been discussed (detect reversions, XML dumps, etc.) and they are not altered, just moved away (being more compatible). And it is not like structure compatibility is going to be kept, as many fields are going to be "moved" there, so code using the tables directly has to change anyway; but if the actual content is not altered, the sha field can be kept unaltered with the same value as before. It would also allow to detect a "partial revertion", that means, mediawiki text is set to the same than a previous one, which is what I assume it is used now. However, now there will be other content that can be reverted individually.
I do not know what exactly MCR is going to be used for, but if (silly idea), main text article and categories are 2 different contents of an article, if user A edits both, and user B reverts the text only, that would get a different revision sha1 value; however, most reasons here would want to detect the reversion by checking the sha of the text only (aka content). Equally, for backwards compatibility, storing it on content would allow to not have to recalculate it for all already existing values literally reducing it to a "trivial" code change, while keeping all old data valid. Keeping the field as is, on revision, will mean all historical data and old dumps are invalid. Full revision reversions, if needed, can be checked by checking each individual content sha or the linked content ids.
If, on the other side, revision should be kept completely backwards compatible, some helper views can be created on the cloud wikireplicas, but other than that, MCR would not be possible.
If at a later time, text with the same hash is detected (and content double checked), content could be normalized by assigning the same id to the same content?
On Mon, Sep 18, 2017 at 8:25 PM, Danny B. Wikipedia.Danny.B@email.cz wrote:
---------- Původní e-mail ---------- Od: Dan Andreescu dandreescu@wikimedia.org Komu: Wikimedia developers wikitech-l@lists.wikimedia.org Datum: 18. 9. 2017 16:26:18 Předmět: Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)? "So, as things stand, rev_sha1 in the database is used for:
- the XML dumps process and all the researchers depending on the XML dumps
(probably just for revert detection) 2. revert detection for libraries like python-mwreverts [1] 3. revert detection in mediawiki history reconstruction processes in Hadoop (Wikistats 2.0) 4. revert detection in Wikistats 1.0 5. revert detection for tools that run on labs, like Wikimetrics ?. I think Aaron also uses rev_sha1 in ORES, but I can't seem to find the latest code for that service
If you think about this list above as a flow of data, you'll see that rev_sha1 is replicated to xml, labs databases, hadoop, ML models, etc. So removing it and adding it back downstream from the main mediawiki database somewhere, like in XML, cuts off the other places that need it. That means it must be available either in the mediawiki database or in some other central database which all those other consumers can pull from. "
I use rev_sha1 on replicas to check the consistency of modules, templates or other pages (typically help) which should be same between projects (either within one language or even crosslanguage, if the page is not language dependent). In other words to detect possible changes in them and syncing them.
Also, I haven't noticed it mentioned in the thread: Flow also notices users on reverts, but IDK whether it uses rev_sha1 or not. So I'm rather mentioning it.
Kind regards
Danny B.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Am 19.09.2017 um 10:15 schrieb Jaime Crespo:
I am not a mediawiki developer, but shouldn't sha1 be moved instead of deleted/not deleted? Moved to the content table- so it is kept unaltered.
The background of my original mail is indede the question whether we need the sha1 field in the content table. The current draft of the DB schema includes it.
That table will be tall, and the sha1 is the (on average) largest field. If we are going to use a different mechanism for tracking reverts soon, my hope was that we can do without it.
OIn any case, my impression is that if we want to keep using hashes to detect reverts, we need to keep rev_sha1 - and to maintain is, we ALSO need content_sha1.
On Tue, Sep 19, 2017 at 6:42 AM, Daniel Kinzler <daniel.kinzler@wikimedia.de
wrote:
That table will be tall, and the sha1 is the (on average) largest field. If we are going to use a different mechanism for tracking reverts soon, my hope was that we can do without it.
Can't you just split it into a separate table? Core would only need to touch it on insert/update, so that should resolve the performance concerns.
Also, since content is supposed to be deduplicated (so two revisions with the exact same content will have the same content_address), cannot that replace content_sha1 for revert detection purposes? That wouldn't work over large periods of time (when the original revision and the revert live in different kinds of stores) but maybe that's an acceptable compromise.
Am 19.09.2017 um 20:48 schrieb Gergo Tisza:
On Tue, Sep 19, 2017 at 6:42 AM, Daniel Kinzler <daniel.kinzler@wikimedia.de Can't you just split it into a separate table? Core would only need to touch it on insert/update, so that should resolve the performance concerns.
Yes, we could put it into a separate table. But that table would be exactly as tall as the content table, and would be keyed to it. I see no advantage. But if DBAs prefer a separate table with a 1:1 relation to the content table, that's fine with me.
Note that the content table is indeed touched a lot less than the revision table.
Also, since content is supposed to be deduplicated (so two revisions with the exact same content will have the same content_address), cannot that replace content_sha1 for revert detection purposes?
Only if we could detect and track "manual" reverts. And the only reliable way to do this right now is by looking at the sha1.
On Thu, Sep 21, 2017 at 6:10 AM, Daniel Kinzler <daniel.kinzler@wikimedia.de
wrote:
Yes, we could put it into a separate table. But that table would be exactly as tall as the content table, and would be keyed to it. I see no advantage.
The advantage is that MediaWiki almost would never need to use the hash table. It would need to add the hash for a new revision there, but table size is not much of an issue on INSERT; other than that, only slow operations like export and API requests which explicitly ask for the hash would need to join on that table. Or this primarily a disk space concern?
Also, since content is supposed to be deduplicated (so two revisions with
the exact same content will have the same content_address), cannot that replace content_sha1 for revert detection purposes?
Only if we could detect and track "manual" reverts. And the only reliable way to do this right now is by looking at the sha1.
The content table points to a blob store which is content-addressible and has its own deduplication mechanism, right? So you just send it the content to store, and get an address back, and in the case of a manual revert, that address will be one that has already been used in other content rows. Or do you need to detect the revert before saving it?
SHA1 is not that slow.
For the API/Special:Export definitely not. Maybe for generating the official dump files it might be significant? A single sha1 operation on a modern CPU should not take more than a microsecond: there are a few hundred operations in a decently implemented sha1 and processors are in the GHz range. PHP benchmarks [1] also give similar values. With the 64-byte block size, that's something like 5 hours/TB - not sure how that compares to the dump process itself (also it's probably running on lots of cores in parallel).
What is the current state, will some kind of digest be retained?
On Thu, Sep 21, 2017 at 9:56 PM, Gergo Tisza gtisza@wikimedia.org wrote:
On Thu, Sep 21, 2017 at 6:10 AM, Daniel Kinzler < daniel.kinzler@wikimedia.de
wrote:
Yes, we could put it into a separate table. But that table would be exactly as tall as the content table, and would be keyed to it. I see no advantage.
The advantage is that MediaWiki almost would never need to use the hash table. It would need to add the hash for a new revision there, but table size is not much of an issue on INSERT; other than that, only slow operations like export and API requests which explicitly ask for the hash would need to join on that table. Or this primarily a disk space concern?
Also, since content is supposed to be deduplicated (so two revisions with
the exact same content will have the same content_address), cannot that replace content_sha1 for revert detection purposes?
Only if we could detect and track "manual" reverts. And the only reliable way to do this right now is by looking at the sha1.
The content table points to a blob store which is content-addressible and has its own deduplication mechanism, right? So you just send it the content to store, and get an address back, and in the case of a manual revert, that address will be one that has already been used in other content rows. Or do you need to detect the revert before saving it?
SHA1 is not that slow.
For the API/Special:Export definitely not. Maybe for generating the official dump files it might be significant? A single sha1 operation on a modern CPU should not take more than a microsecond: there are a few hundred operations in a decently implemented sha1 and processors are in the GHz range. PHP benchmarks [1] also give similar values. With the 64-byte block size, that's something like 5 hours/TB - not sure how that compares to the dump process itself (also it's probably running on lots of cores in parallel).
[1] http://www.spudsdesign.com/benchmark/index.php?t=hash1 _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Am 06.12.2017 um 22:09 schrieb John Erling Blad:
What is the current state, will some kind of digest be retained?
The current plan is to keep the SHA1 hash, one for each slot, and an aggregate one for the revision. If there is only one slot, the revision hash is the same as the slat hash.
On 15/09/2017 12:51, Daniel Kinzler wrote:
I'm working on the database schema for Multi-Content-Revisions (MCR) https://www.mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema and I'd like to get rid of the rev_sha1 field:
Maintaining revision hashes (the rev_sha1 field) is expensive, and becomes more expensive with MCR. With multiple content objects per revision, we need to track the hash for each slot, and then re-calculate the sha1 for each revision.
<snip>
Hello,
That was introduced by Aaron Schulz. The purpose is to have them pre computed since that is quite expensive to have to do it on million of rows.
A use case was to easily detect reverts.
See for reference: https://phabricator.wikimedia.org/T23860 https://phabricator.wikimedia.org/T27312
I guess Aaron Halfaker, Brion Vibber, Aaron Schulz would have some insights about it.
Antoine Musso wrote:
I guess Aaron Halfaker, Brion Vibber, Aaron Schulz would have some insights about it.
Yes. Brion started a thread about the use of SHA-1 in February 2017:
https://lists.wikimedia.org/pipermail/wikitech-l/2017-February/087664.html https://lists.wikimedia.org/pipermail/wikitech-l/2017-February/087666.html
Of note, we have https://www.mediawiki.org/wiki/Manual:Hashing.
The use of base-36 SHA-1 instead of base-16 SHA-1 for revision.rev_sha1 has always perplexed me. It'd be nice to better(?) document that design decision. It's referenced here: https://lists.wikimedia.org/pipermail/wikitech-l/2012-September/063445.html
MZMcBride
The revision hashes are also supposed to be used by at least some of the import tools for XML dumps. The dumps would be less valuable without some way to check their content. Generating hashes on the fly is surely not an option given exports can also need to happen within the time of a PHP request (Special:Export for instance).
Nemo
Am 21.09.2017 um 11:24 schrieb Federico Leva (Nemo):
The revision hashes are also supposed to be used by at least some of the import tools for XML dumps. The dumps would be less valuable without some way to check their content.
While this is a typical use cacse for hashes in theory, i have never heard of any MediaWiki related tool actually doing this.
Generating hashes on the fly is surely not an option given exports can also need to happen within the time of a PHP request (Special:Export for instance).
Hashing is a lot faster than loading the content. Since Special:Export needs to load the content anyway, the extra cost of hashing is negligible.
If we only need the hashes in contexts where we also need the full content, generating on the fly should work fine.
But if we need revision hashes in a list of 500 revisions returned from the API, *that* we can't calculate on the fly. Similarly, database queries that need the hashes to detect revisions with the same content can't use on-the-fly hashes.
wikitech-l@lists.wikimedia.org