Dear All,
I was yesterday at OpenSym (many thanks to Dirk for organizing this!), and
I was chatting with some people about attribution of content to its authors
in a wiki.
So I got inspired, and I cleaned up some code that Michael Shavlovsky and I
had written for this:
https://github.com/lucadealfaro/authorship-tracking
The way to use it is super simple (see below). The attribution object can
also be serialized and de-serialized to/from json (see documentation on
github).
The idea behind the code is to attribute the content to the *earliest
revision *where the content was inserted, not the latest as diff tools
usually do. So if some piece of text is inserted, then deleted, then
re-inserted (in a revert or a normal edit), we still attribute it to the
earliest revision. This is somewhat similar to what we tried to do in
WikiTrust, but it's better done, and far more efficient.
The algorithm details can be found in
http://www2013.wwwconference.org/proceedings/p343.pdf
I hope this might be of interest!
Luca
import authorship_attribution
a = authorship_attribution.AuthorshipAttribution.new_attribution_processor(N=4)
a.add_revision("I like to eat pasta".split(), revision_info="rev0")
a.add_revision("I like to eat pasta with tomato sauce".split(),
revision_info="rev1")
a.add_revision("I like to eat rice with tomato sauce".split(),
revision_info="rev3")print a.get_attribution()
['rev0', 'rev0', 'rev0', 'rev0', 'rev3',
'rev1', 'rev1', 'rev1']