Dear Wiki Researchers,
During the summer we have worked on Wikihadoop [0], a tool that allows us to create the diffs between two revisions of a Wiki article using Hadoop.
This dataset is based on the English Wikipedia April 2011 XML dump files. The advantage of this dataset is that:
a) You can search for specific content being added / removed
b) Measure more accurately how much text an editor has added or removed
We are currently working on a Lucene-based application [1] that will allow us to quickly search for specific strings being added or removed.
If you have any questions, then please let me know!
Best regards,
Diederik van Liere
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l