Sending this again from my current address. Left Gmail a long time ago -- not sure the redirect still works... My apologies if this is hitting your inbox twice!
_ _ _ _
Dear Wikimedia research community,
I'd have a question for the data savvy people on this list :)
My goal is simple: for a sample of English Wikipedia editors, I'm trying to identify their edits which were reverted. I can see two possible way of doing this:
1. Identify the reverts using the SHA1 values. (A revert happens when the edit exactly restores the page to its previous state.)
2. Identify the reverts using the "undo" button.
As I see it, solution 2 is less "precise" (you'll miss some reverts, e.g., those performed manually). However, it would also be less computationally intensive, and I don't see that it would introduce any bias (results can be compared across editors in a statistical model).
However, I do not see the information about whether a revision was reverted using the “undo” button in the enwiki database: https://www.mediawiki.org/w/index.php?title=Manual:Database_layout/diagram&a...
I find this surprising. Am I missing something? (And if so, how do you personally feel about strategy 1 vs. strategy 2?)
Thank you so much for any insight you might be willing to provide! :D
Sincerely,
Jérôme
Hi Jérôme,
Have you looked at python-mwreverts https://github.com/mediawiki-utilities/python-mwreverts? This library has been used by many researchers who are studying reverted edits, and it may be useful for your work as well.
All the best,
--Nathan
On Fri, Apr 14, 2023 at 11:15 AM jerome.hergueux@mailo.com wrote:
Sending this again from my current address. Left Gmail a long time ago -- not sure the redirect still works... My apologies if this is hitting your inbox twice!
Dear Wikimedia research community,
I'd have a question for the data savvy people on this list :)
My goal is simple: for a sample of English Wikipedia editors, I'm trying to identify their edits which were reverted. I can see two possible way of doing this:
- Identify the reverts using the SHA1 values. (A revert happens when the
edit exactly restores the page to its previous state.)
- Identify the reverts using the "undo" button.
As I see it, solution 2 is less "precise" (you'll miss some reverts, e.g., those performed manually). However, it would also be less computationally intensive, and I don't see that it would introduce any bias (results can be compared across editors in a statistical model).
However, I do not see the information about whether a revision was reverted using the “undo” button in the enwiki database: https://www.mediawiki.org/w/index.php?title=Manual:Database_layout/diagram&a...
I find this surprising. Am I missing something? (And if so, how do you personally feel about strategy 1 vs. strategy 2?)
Thank you so much for any insight you might be willing to provide! :D
Sincerely,
Jérôme
Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to wiki-research-l-leave@lists.wikimedia.org
Hi Jérôme, I wrote a little overview of this a while back that might be of use: https://meta.wikimedia.org/wiki/User:Isaac_(WMF)/Analysis_gotchas#Reverts_(P...)
Essentially, the library that Nathan suggested (mwreverts) is great for the shasum-based approach and you'll need to use edit tags https://en.wikipedia.org/wiki/Wikipedia:Tags to check for additional tool-based reverts like mw-undo, mw-rollback, etc. I think combining the two approaches makes the most sense and you can see a bunch more details on their overlap for English Wikipedia in this task: https://phabricator.wikimedia.org/T266374
It sounds like you're collecting specific edits so this is probably less relevant, but I'll also highlight the excellent public dataset put together by the Wikimedia Foundation Data Engineering team that has the full edit history for each language edition and includes metadata such as whether the edit was a revert based on shasums as well as the edit tags. If you were processing many many edits, I'd suggest starting with this as it would have all the information you need in one place.
- More details: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/MediaWiki_hist... - You can see an example of how to access and process these dumps on Wikimedia-hosted Jupyter notebooks (PAWS https://wikitech.wikimedia.org/wiki/PAWS) here: https://public-paws.wmcloud.org/User:Isaac_(WMF)/Denormalized%20Edit%20Histo...
Hope that helps!
Best, Isaac
On Fri, Apr 14, 2023 at 11:33 AM J. Nathan Matias natematias@gmail.com wrote:
Hi Jérôme,
Have you looked at python-mwreverts https://github.com/mediawiki-utilities/python-mwreverts? This library has been used by many researchers who are studying reverted edits, and it may be useful for your work as well.
All the best,
--Nathan
On Fri, Apr 14, 2023 at 11:15 AM jerome.hergueux@mailo.com wrote:
Sending this again from my current address. Left Gmail a long time ago -- not sure the redirect still works... My apologies if this is hitting your inbox twice!
Dear Wikimedia research community,
I'd have a question for the data savvy people on this list :)
My goal is simple: for a sample of English Wikipedia editors, I'm trying to identify their edits which were reverted. I can see two possible way
of
doing this:
- Identify the reverts using the SHA1 values. (A revert happens when the
edit exactly restores the page to its previous state.)
- Identify the reverts using the "undo" button.
As I see it, solution 2 is less "precise" (you'll miss some reverts,
e.g.,
those performed manually). However, it would also be less computationally intensive, and I don't see that it would introduce any bias (results can
be
compared across editors in a statistical model).
However, I do not see the information about whether a revision was reverted using the “undo” button in the enwiki database:
https://www.mediawiki.org/w/index.php?title=Manual:Database_layout/diagram&a...
I find this surprising. Am I missing something? (And if so, how do you personally feel about strategy 1 vs. strategy 2?)
Thank you so much for any insight you might be willing to provide! :D
Sincerely,
Jérôme
Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to
wiki-research-l-leave@lists.wikimedia.org
-- J. Nathan Matias http://natematias.com/ : Center for Advanced Study in the Behavioral Sciences : Cornell University : Citizens and Technology Lab https://citizensandtech.org : social.coop/@natematias : blog https://natematias.com/external-posts/ : daylight time photos https://social.coop/@natematias/109423664679446879 _______________________________________________ Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to wiki-research-l-leave@lists.wikimedia.org
Hear Hear! There are no problems ever posted on this list. Only solutions! :D
Thank you for those insights! Looks like I have a plan :)
Jérôme
De : Isaac Johnson isaac@wikimedia.org À : wiki-research-l@lists.wikimedia.org Sujet : [Wiki-research-l] Re: Collecting reverted edits at user level Date : 14/04/2023 17:47:56 Europe/Paris
Hi Jérôme, I wrote a little overview of this a while back that might be of use: https://meta.wikimedia.org/wiki/User:Isaac_(WMF)/Analysis_gotchas#Reverts_(P...)
Essentially, the library that Nathan suggested (mwreverts) is great for the shasum-based approach and you'll need to use edit tags https://en.wikipedia.org/wiki/Wikipedia:Tags to check for additional tool-based reverts like mw-undo, mw-rollback, etc. I think combining the two approaches makes the most sense and you can see a bunch more details on their overlap for English Wikipedia in this task: https://phabricator.wikimedia.org/T266374
It sounds like you're collecting specific edits so this is probably less relevant, but I'll also highlight the excellent public dataset put together by the Wikimedia Foundation Data Engineering team that has the full edit history for each language edition and includes metadata such as whether the edit was a revert based on shasums as well as the edit tags. If you were processing many many edits, I'd suggest starting with this as it would have all the information you need in one place.
- More details: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/MediaWiki_hist... - You can see an example of how to access and process these dumps on Wikimedia-hosted Jupyter notebooks (PAWS https://wikitech.wikimedia.org/wiki/PAWS) here: https://public-paws.wmcloud.org/User:Isaac_(WMF)/Denormalized%20Edit%20Histo...
Hope that helps!
Best, Isaac
On Fri, Apr 14, 2023 at 11:33 AM J. Nathan Matias natematias@gmail.com wrote:
Hi Jérôme,
Have you looked at python-mwreverts https://github.com/mediawiki-utilities/python-mwreverts? This library has been used by many researchers who are studying reverted edits, and it may be useful for your work as well.
All the best,
--Nathan
On Fri, Apr 14, 2023 at 11:15 AM jerome.hergueux@mailo.com wrote:
Sending this again from my current address. Left Gmail a long time ago -- not sure the redirect still works... My apologies if this is hitting your inbox twice!
Dear Wikimedia research community,
I'd have a question for the data savvy people on this list :)
My goal is simple: for a sample of English Wikipedia editors, I'm trying to identify their edits which were reverted. I can see two possible way
of
doing this:
- Identify the reverts using the SHA1 values. (A revert happens when the
edit exactly restores the page to its previous state.)
- Identify the reverts using the "undo" button.
As I see it, solution 2 is less "precise" (you'll miss some reverts,
e.g.,
those performed manually). However, it would also be less computationally intensive, and I don't see that it would introduce any bias (results can
be
compared across editors in a statistical model).
However, I do not see the information about whether a revision was reverted using the “undo” button in the enwiki database:
https://www.mediawiki.org/w/index.php?title=Manual:Database_layout/diagram&a...
I find this surprising. Am I missing something? (And if so, how do you personally feel about strategy 1 vs. strategy 2?)
Thank you so much for any insight you might be willing to provide! :D
Sincerely,
Jérôme
Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to
wiki-research-l-leave@lists.wikimedia.org
-- J. Nathan Matias http://natematias.com/ : Center for Advanced Study in the Behavioral Sciences : Cornell University : Citizens and Technology Lab https://citizensandtech.org : social.coop/@natematias : blog https://natematias.com/external-posts/ : daylight time photos https://social.coop/@natematias/109423664679446879 _______________________________________________ Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to wiki-research-l-leave@lists.wikimedia.org
wiki-research-l@lists.wikimedia.org