Hello,
I was exploring the dataset shared in the Wikipedia Detox
<https://meta.wikimedia.org/wiki/Research:Modeling_Talk_Page_Abuse>
project. I was trying to use the similar diff logic to obtain the changes
from a page using *revid* but realized that the Wikipedia API provides only
the diff of the revision with its earlier version. I am able to fetch the
diffs for a set of *revids* using the Wikipedia API, but I am unable to
extract only the changed sentences in the revision. I found this
<https://github.com/ewulczyn/wiki-detox/blob/master/src/data_generation/diff_utils.py>
particular
script from the project source files that contain bits of what might have
been used in the actual data collection process to obtain the changes from
the Talk pages, but I am unable to figure out the high-level information
such as input/output formats etc.
Can anyone provide a solution to this or any suggestions on how to proceed?
Also, It would be really beneficial if I could use the same diff logic as
used by the original authors to ensure consistency.
Meanwhile, I have asked a similar question on StackOverflow
<https://stackoverflow.com/questions/46010675/extract-changes-from-wikipedia-wikimedia-revision-pages>
and
emailed the original Wikimedia author of the paper.
Regards,
Pinkesh Badjatiya
pinkeshbadjatiya(a)gmail.com
IIIT Hyderabad