Dear All,
I was yesterday at OpenSym (many thanks to Dirk for organizing this!), and I was chatting with some people about attribution of content to its authors in a wiki. So I got inspired, and I cleaned up some code that Michael Shavlovsky and I had written for this:
https://github.com/lucadealfaro/authorship-tracking
The way to use it is super simple (see below). The attribution object can also be serialized and de-serialized to/from json (see documentation on github).
The idea behind the code is to attribute the content to the *earliest revision *where the content was inserted, not the latest as diff tools usually do. So if some piece of text is inserted, then deleted, then re-inserted (in a revert or a normal edit), we still attribute it to the earliest revision. This is somewhat similar to what we tried to do in WikiTrust, but it's better done, and far more efficient.
The algorithm details can be found in http://www2013.wwwconference.org/proceedings/p343.pdf
I hope this might be of interest!
Luca
import authorship_attribution
a = authorship_attribution.AuthorshipAttribution.new_attribution_processor(N=4) a.add_revision("I like to eat pasta".split(), revision_info="rev0") a.add_revision("I like to eat pasta with tomato sauce".split(), revision_info="rev1") a.add_revision("I like to eat rice with tomato sauce".split(), revision_info="rev3")print a.get_attribution()
['rev0', 'rev0', 'rev0', 'rev0', 'rev3', 'rev1', 'rev1', 'rev1']