Hello,
I was exploring the dataset shared in the Wikipedia Detox https://meta.wikimedia.org/wiki/Research:Modeling_Talk_Page_Abuse project. I was trying to use the similar diff logic to obtain the changes from a page using *revid* but realized that the Wikipedia API provides only the diff of the revision with its earlier version. I am able to fetch the diffs for a set of *revids* using the Wikipedia API, but I am unable to extract only the changed sentences in the revision. I found this https://github.com/ewulczyn/wiki-detox/blob/master/src/data_generation/diff_utils.py particular script from the project source files that contain bits of what might have been used in the actual data collection process to obtain the changes from the Talk pages, but I am unable to figure out the high-level information such as input/output formats etc.
Can anyone provide a solution to this or any suggestions on how to proceed? Also, It would be really beneficial if I could use the same diff logic as used by the original authors to ensure consistency.
Meanwhile, I have asked a similar question on StackOverflow https://stackoverflow.com/questions/46010675/extract-changes-from-wikipedia-wikimedia-revision-pages and emailed the original Wikimedia author of the paper.
Regards, Pinkesh Badjatiya pinkeshbadjatiya@gmail.com IIIT Hyderabad
I believe that Ellery's work used my mwdiffs library which is largely based on deltas.
http://pythonhosted.org/mwdiffs/ http://pythonhosted.org/deltas/
On Sun, Sep 3, 2017 at 2:54 PM, Pinkesh Badjatiya < pinkeshbadjatiya@gmail.com> wrote:
Hello,
I was exploring the dataset shared in the Wikipedia Detox https://meta.wikimedia.org/wiki/Research:Modeling_Talk_Page_Abuse project. I was trying to use the similar diff logic to obtain the changes from a page using *revid* but realized that the Wikipedia API provides only the diff of the revision with its earlier version. I am able to fetch the diffs for a set of *revids* using the Wikipedia API, but I am unable to extract only the changed sentences in the revision. I found this https://github.com/ewulczyn/wiki-detox/blob/master/src/ data_generation/diff_utils.py particular script from the project source files that contain bits of what might have been used in the actual data collection process to obtain the changes from the Talk pages, but I am unable to figure out the high-level information such as input/output formats etc.
Can anyone provide a solution to this or any suggestions on how to proceed? Also, It would be really beneficial if I could use the same diff logic as used by the original authors to ensure consistency.
Meanwhile, I have asked a similar question on StackOverflow https://stackoverflow.com/questions/46010675/extract- changes-from-wikipedia-wikimedia-revision-pages and emailed the original Wikimedia author of the paper.
Regards, Pinkesh Badjatiya pinkeshbadjatiya@gmail.com IIIT Hyderabad _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
wiki-research-l@lists.wikimedia.org