I've updated my dump processing python project to include code for quickly detecting identity reverts from XML dumps. See https://bitbucket.org/halfak/wikimedia-utilities for the project and the process() function at bottom of https://bitbucket.org/halfak/wikimedia-utilities/src/f1c8fe7224f3/wmf/dump/p... for the algorithm. The actual function with the revert detection logic is about 50 lines long.
The resulting dump.map function using this revert processor() will emit "revert" revisions and "reverted" revisions with the following fields:
Revert revision:
- "revert" - denotes that this row is a reverting edit - revision_id - the rev_id if the reverting edit - reverted_to_id - the rev_id of the reverted to edit - for_vandalism - using D_LOOSE/D_STRICT regular expression on the reverting comment (See Priedhorsky et al. "Creating, Destroying and Restoring Value in Wikipedia" GROUP 2007) - reverted_revs - number of revisions that were reverted (this is the number of revisions between the reverting edit and reverted to edit)
Reverted revision:
- "reverted" - denotes that this row is a reverted edit - revision_id - the rev_id of the reverted edit - reverting_id - the rev_id if the reverting edit - reverted_to_id - the rev_id of the reverted to edit - for_vandalism - using D_LOOSE/D_STRICT regular expression on the reverting comment (See Priedhorsky et al. "Creating, Destroying and Restoring Value in Wikipedia" GROUP 2007) - reverted_revs - number of revisions that were reverted (this is the number of revisions between the reverting edit and reverted to edit)
I hope this is helpful.
-Aaron
On Fri, Aug 19, 2011 at 3:08 PM, Aaron Halfaker aaron.halfaker@gmail.comwrote:
An identity revert is one which changes the article to an absolutely identical previous state. This is a common operation in the English Wikipedia.
There is a Kittur & Kraut (and others) paper which I can't recall that found the vast majority of reverts of any sort were identity. Some other types the define are:
- "Partial reverts": Part of an edit is discarded
- "Effective reverts": Looks to be an identity revert, but not
*exactly* the same as a previous revision. Often a few white-space characters were out of place.
See http://www.grouplens.org/node/427 for a discussion of the difficulty of detecting reverts in better ways.
My code detects identity reverts. For example suppose the following is the content of a sequence of revisions.
- "foo"
- "bar"
- "foobar"
- "bar"
- "barbar"
Revision #4 reverts back to revision #2 and revision #3 is reverted. When looking for identity reverts, I have found that limiting the number of revisions that can be reverted to ~15 produces the highest quality of results. This is discussed in http://www.grouplens.org/node/416 (see http://www-users.cs.umn.edu/~halfak/summaries/A_Jury_of_Your_Peers.html for quick/dirty summary of the work.).
This subject deserves a long conversation, but I think the bit you might be interested in is that the identity revert (described above and example) seems to be the accepted approach for identifying reverts for most types of analyses.
-Aaron
On Fri, Aug 19, 2011 at 4:39 PM, Flöck, Fabian fabian.floeck@kit.eduwrote:
Hi Aaron,
thanks, that would be awesome :) we built something ourselves, but I'm not quite content with it.
Could you also tell me how you defined a revert (and maybe how you determine who is the reverter)? Because this is a crucial issue for me. Is it the complete deletion of all the characters entered by an editor in an edit? What about editors that revert others or delete content? do you treat their edits as being reverted if the deleted content gets reintroduced? Did you take into account location of the words in the text or did you use a bag-of-words model? I read many papers and tool documentations that use "reverts", and some mention their method (while many don't), while it seems almost no-one describes their definition of what a "revert" actually is.
But maybe I will get the answers to this from your code as well :)
Anyway, thanks for the help!
Best, Fabian
On 19 Aug 2011, at 18:31, Aaron Halfaker wrote:
Fabian,
I actually have some software for quickly producing reverts from a database dump. The framework for doing it is available here: https://bitbucket.org/halfak/wikimedia-utilities. I still have to package up the code that actually generates the reverts though. It's just a matter of finding time to sit down with it and figure out the dependencies! I expect that I can have it ready by Monday. I hope to actually package up the revert detecting code into the above python project as an example.
I just wanted to let you know that I have a response for you on the way.
-Aaron
On Thu, Aug 18, 2011 at 4:40 AM, Flöck, Fabian fabian.floeck@kit.eduwrote:
Hi,
I'm trying to detect reverts in Wikipedia for my research, right now with a self-built script using MD5hashes and DIFFs between revisions. I always read about people taking reverts into account in their data, but it's seldomly described HOW exactly a revert is determined or what tool they use to do that. Can you point me to any research or tools or tell me maybe what you used in your own research to identify which edits were reverted and/or who reverted them?
Best,
Fabian
-- Karlsruhe Institute of Technology (KIT) Institute of Applied Informatics and Formal Description Methods
Dipl.-Medwiss. Fabian Flöck Research Associate
Building 11.40, Room 222 KIT-Campus South D-76128 Karlsruhe
Phone: +49 721 608 4 6584 Skype: f.floeck_work E-Mail: fabian.floeck@kit.edu WWW: http://www.aifb.kit.edu/web/Fabian_Fl%C3%B6ck
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
--
Karlsruhe Institute of Technology (KIT) Institute of Applied Informatics and Formal Description Methods
Dipl.-Medwiss. Fabian Flöck Research Associate
Building 11.40, Room 222 KIT-Campus South D-76128 Karlsruhe
Phone: +49 721 608 4 6584 Skype: f.floeck_work E-Mail: fabian.floeck@kit.edu WWW: http://www.aifb.kit.edu/web/Fabian_Fl%C3%B6ck
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association
Hi Aaron,
Neat LimitedQueue class. It looks like this reverts code wouldn't handle some corner cases, for example I don't see logic that would distinguish between blanking (which produces duplicate checksums) and reverts.
-- Best, Dmitry
On Sun, Aug 21, 2011 at 3:15 PM, Aaron Halfaker aaron.halfaker@gmail.comwrote:
I've updated my dump processing python project to include code for quickly detecting identity reverts from XML dumps. See https://bitbucket.org/halfak/wikimedia-utilities for the project and the process() function at bottom of https://bitbucket.org/halfak/wikimedia-utilities/src/f1c8fe7224f3/wmf/dump/p... for the algorithm. The actual function with the revert detection logic is about 50 lines long.
The resulting dump.map function using this revert processor() will emit "revert" revisions and "reverted" revisions with the following fields:
Revert revision:
- "revert" - denotes that this row is a reverting edit
- revision_id - the rev_id if the reverting edit
- reverted_to_id - the rev_id of the reverted to edit
- for_vandalism - using D_LOOSE/D_STRICT regular expression on the
reverting comment (See Priedhorsky et al. "Creating, Destroying and Restoring Value in Wikipedia" GROUP 2007)
- reverted_revs - number of revisions that were reverted (this is the
number of revisions between the reverting edit and reverted to edit)
Reverted revision:
- "reverted" - denotes that this row is a reverted edit
- revision_id - the rev_id of the reverted edit
- reverting_id - the rev_id if the reverting edit
- reverted_to_id - the rev_id of the reverted to edit
- for_vandalism - using D_LOOSE/D_STRICT regular expression on the
reverting comment (See Priedhorsky et al. "Creating, Destroying and Restoring Value in Wikipedia" GROUP 2007)
- reverted_revs - number of revisions that were reverted (this is the
number of revisions between the reverting edit and reverted to edit)
I hope this is helpful.
-Aaron
On Fri, Aug 19, 2011 at 3:08 PM, Aaron Halfaker aaron.halfaker@gmail.comwrote:
An identity revert is one which changes the article to an absolutely identical previous state. This is a common operation in the English Wikipedia.
There is a Kittur & Kraut (and others) paper which I can't recall that found the vast majority of reverts of any sort were identity. Some other types the define are:
- "Partial reverts": Part of an edit is discarded
- "Effective reverts": Looks to be an identity revert, but not
*exactly* the same as a previous revision. Often a few white-space characters were out of place.
See http://www.grouplens.org/node/427 for a discussion of the difficulty of detecting reverts in better ways.
My code detects identity reverts. For example suppose the following is the content of a sequence of revisions.
- "foo"
- "bar"
- "foobar"
- "bar"
- "barbar"
Revision #4 reverts back to revision #2 and revision #3 is reverted. When looking for identity reverts, I have found that limiting the number of revisions that can be reverted to ~15 produces the highest quality of results. This is discussed in http://www.grouplens.org/node/416 (see http://www-users.cs.umn.edu/~halfak/summaries/A_Jury_of_Your_Peers.html for quick/dirty summary of the work.).
This subject deserves a long conversation, but I think the bit you might be interested in is that the identity revert (described above and example) seems to be the accepted approach for identifying reverts for most types of analyses.
-Aaron
On Fri, Aug 19, 2011 at 4:39 PM, Flöck, Fabian fabian.floeck@kit.eduwrote:
Hi Aaron,
thanks, that would be awesome :) we built something ourselves, but I'm not quite content with it.
Could you also tell me how you defined a revert (and maybe how you determine who is the reverter)? Because this is a crucial issue for me. Is it the complete deletion of all the characters entered by an editor in an edit? What about editors that revert others or delete content? do you treat their edits as being reverted if the deleted content gets reintroduced? Did you take into account location of the words in the text or did you use a bag-of-words model? I read many papers and tool documentations that use "reverts", and some mention their method (while many don't), while it seems almost no-one describes their definition of what a "revert" actually is.
But maybe I will get the answers to this from your code as well :)
Anyway, thanks for the help!
Best, Fabian
On 19 Aug 2011, at 18:31, Aaron Halfaker wrote:
Fabian,
I actually have some software for quickly producing reverts from a database dump. The framework for doing it is available here: https://bitbucket.org/halfak/wikimedia-utilities. I still have to package up the code that actually generates the reverts though. It's just a matter of finding time to sit down with it and figure out the dependencies! I expect that I can have it ready by Monday. I hope to actually package up the revert detecting code into the above python project as an example.
I just wanted to let you know that I have a response for you on the way.
-Aaron
On Thu, Aug 18, 2011 at 4:40 AM, Flöck, Fabian fabian.floeck@kit.eduwrote:
Hi,
I'm trying to detect reverts in Wikipedia for my research, right now with a self-built script using MD5hashes and DIFFs between revisions. I always read about people taking reverts into account in their data, but it's seldomly described HOW exactly a revert is determined or what tool they use to do that. Can you point me to any research or tools or tell me maybe what you used in your own research to identify which edits were reverted and/or who reverted them?
Best,
Fabian
-- Karlsruhe Institute of Technology (KIT) Institute of Applied Informatics and Formal Description Methods
Dipl.-Medwiss. Fabian Flöck Research Associate
Building 11.40, Room 222 KIT-Campus South D-76128 Karlsruhe
Phone: +49 721 608 4 6584 Skype: f.floeck_work E-Mail: fabian.floeck@kit.edu WWW: http://www.aifb.kit.edu/web/Fabian_Fl%C3%B6ck
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
--
Karlsruhe Institute of Technology (KIT) Institute of Applied Informatics and Formal Description Methods
Dipl.-Medwiss. Fabian Flöck Research Associate
Building 11.40, Room 222 KIT-Campus South D-76128 Karlsruhe
Phone: +49 721 608 4 6584 Skype: f.floeck_work E-Mail: fabian.floeck@kit.edu WWW: http://www.aifb.kit.edu/web/Fabian_Fl%C3%B6ck
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
wiki-research-l@lists.wikimedia.org