So, as things stand, rev_sha1 in the database is used for:
1. the XML dumps process and all the researchers depending on the XML dumps (probably just for revert detection) 2. revert detection for libraries like python-mwreverts [1] 3. revert detection in mediawiki history reconstruction processes in Hadoop (Wikistats 2.0) 4. revert detection in Wikistats 1.0 5. revert detection for tools that run on labs, like Wikimetrics ?. I think Aaron also uses rev_sha1 in ORES, but I can't seem to find the latest code for that service
If you think about this list above as a flow of data, you'll see that rev_sha1 is replicated to xml, labs databases, hadoop, ML models, etc. So removing it and adding it back downstream from the main mediawiki database somewhere, like in XML, cuts off the other places that need it. That means it must be available either in the mediawiki database or in some other central database which all those other consumers can pull from.
I defer to your expertise when you say it's expensive to keep in the db, and I can see how that would get much worse with MCR. I'm sure we can figure something out, though. Right now it seems like our options are, as others have pointed out:
* compute async and store in DB or somewhere else that's central and easy to access from all the branches I mentioned * update how we detect reverts and keep a revert database with good references to wiki_db, rev_id so it can be brought back in context.
Personally, I would love to get better revert detection, using sha1 exact matches doesn't really get to the heart of the issue. Important phenomena like revert wars, bullying, and stalking are hiding behind bad revert detection. I'm happy to brainstorm ways we can use Analytics infrastructure to do this. We definitely have the tools necessary, but not so much the man-power. That said, please don't strip out rev_sha1 until we've accounted for all its "data customers".
So, put another way, I think it's totally fine if we say ok everyone, from date XYZ, you will no longer have rev_sha1 in the database, but if you want to know whether an edit reverts a previous edit or a series of edits, go *HERE*. That's fine. And just for context, here's how we do our revert detection in Hadoop (it's pretty fancy) [2].
[1] https://github.com/mediawiki-utilities/python-mwreverts [2] https://github.com/wikimedia/analytics-refinery-source/blob/1d38b8e4acfd10dc...
On Mon, Sep 18, 2017 at 9:19 AM, Daniel Kinzler <daniel.kinzler@wikimedia.de
wrote:
Am 16.09.2017 um 01:22 schrieb Matthew Flaschen:
On 09/15/2017 06:51 AM, Daniel Kinzler wrote:
Also, I believe Roan is currently looking for a better mechanism for
tracking
all kinds of reverts directly.
Let's see if we want to use rev_sha1 for that better solution (a way to
track
reverts within MW itself) before we drop it.
The problem is that if we don't drop is, we have to *introduce* it for the new content table for MCR. I'd like to avoid that.
I guess we can define the field and just null it, but... well. I'd like to avoid that.
-- Daniel Kinzler Principal Platform Engineer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l