So, as things stand, rev_sha1 in the database is used for:
1. the XML dumps process and all the researchers depending on the XML dumps
(probably just for revert detection)
2. revert detection for libraries like python-mwreverts [1]
3. revert detection in mediawiki history reconstruction processes in Hadoop
(Wikistats 2.0)
4. revert detection in Wikistats 1.0
5. revert detection for tools that run on labs, like Wikimetrics
?. I think Aaron also uses rev_sha1 in ORES, but I can't seem to find the
latest code for that service
If you think about this list above as a flow of data, you'll see that
rev_sha1 is replicated to xml, labs databases, hadoop, ML models, etc. So
removing it and adding it back downstream from the main mediawiki database
somewhere, like in XML, cuts off the other places that need it. That means
it must be available either in the mediawiki database or in some other
central database which all those other consumers can pull from.
I defer to your expertise when you say it's expensive to keep in the db,
and I can see how that would get much worse with MCR. I'm sure we can
figure something out, though. Right now it seems like our options are, as
others have pointed out:
* compute async and store in DB or somewhere else that's central and easy
to access from all the branches I mentioned
* update how we detect reverts and keep a revert database with good
references to wiki_db, rev_id so it can be brought back in context.
Personally, I would love to get better revert detection, using sha1 exact
matches doesn't really get to the heart of the issue. Important phenomena
like revert wars, bullying, and stalking are hiding behind bad revert
detection. I'm happy to brainstorm ways we can use Analytics
infrastructure to do this. We definitely have the tools necessary, but not
so much the man-power. That said, please don't strip out rev_sha1 until
we've accounted for all its "data customers".
So, put another way, I think it's totally fine if we say ok everyone, from
date XYZ, you will no longer have rev_sha1 in the database, but if you want
to know whether an edit reverts a previous edit or a series of edits, go
*HERE*. That's fine. And just for context, here's how we do our revert
detection in Hadoop (it's pretty fancy) [2].
[1]
https://github.com/mediawiki-utilities/python-mwreverts
[2]
https://github.com/wikimedia/analytics-refinery-source/blob/1d38b8e4acfd10d…
On Mon, Sep 18, 2017 at 9:19 AM, Daniel Kinzler <daniel.kinzler(a)wikimedia.de
wrote:
Am 16.09.2017 um 01:22 schrieb Matthew Flaschen:
On 09/15/2017 06:51 AM, Daniel Kinzler wrote:
> Also, I believe Roan is currently looking for a better mechanism for
tracking
all kinds
of reverts directly.
Let's see if we want to use rev_sha1 for that better solution (a way to
track
reverts within MW itself) before we drop it.
The problem is that if we don't drop is, we have to *introduce* it for the
new
content table for MCR. I'd like to avoid that.
I guess we can define the field and just null it, but... well. I'd like to
avoid
that.
--
Daniel Kinzler
Principal Platform Engineer
Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l