Hey Luca!

Welcome back to the content persistence tracking club! 

I feel like I should clear up some misconceptions.  1st, yours is not the first python library that is useful for determining the authorship of content in versioned text and I don't think you have given fair treatment to the work we have been doing since you last worked on WikiTrust.  For example, its hard to tell from your description whether you are doing anything different than the wikiwho api[2] with tracking content historically.  Further the work I have been doing with diff-based content persistence (e.g. [1]) is not so simple as to not notice removals and re-additions under most circumstances.

In my opinion, this is much better for measuring the productivity of a contribution (adding content that looks like content that was removed long ago is still productive, isn't it?), but maybe less useful for attributing a first contributor status to a particular sub-statement.  Regardless, it seems that a qualitative analysis is necessary to determine whether these differences matter and whether one strategy is better than the other.  AFAICT, the only software that has received this kind of analysis is wikiwho (discussed in [3]).

Regardless, it's great to have you working in this space again and I welcome you to help us develop overview of content persistence measurement strategies that is complete and allows others to critically decide which strategy matches their needs.   See https://meta.wikimedia.org/wiki/Research:Content_persistence for such an overview.  I encourage you to use this description of persistence measures to differentiate your strategy from the work we have been doing over the last 5 years.  Edit boldly!

1. https://pythonhosted.org/mediawiki-utilities/lib/persistence.html#mw-lib-persistence
2. http://people.aifb.kit.edu/ffl/wikiwho/
3. http://people.aifb.kit.edu/ffl/wikiwho/fp715-floeck.pdf

-Aaron


On Aug 21, 2015 4:52 PM, "Luca de Alfaro" <luca@dealfaro.com> wrote:
Dear All, 

I was yesterday at OpenSym (many thanks to Dirk for organizing this!), and I was chatting with some people about attribution of content to its authors in a wiki. 
So I got inspired, and I cleaned up some code that Michael Shavlovsky and I had written for this:

https://github.com/lucadealfaro/authorship-tracking

The way to use it is super simple (see below).  The attribution object can also be serialized and de-serialized to/from json (see documentation on github). 

The idea behind the code is to attribute the content to the earliest revision where the content was inserted, not the latest as diff tools usually do.  So if some piece of text is inserted, then deleted, then re-inserted (in a revert or a normal edit), we still attribute it to the earliest revision.  This is somewhat similar to what we tried to do in WikiTrust, but it's better done, and far more efficient.

The algorithm details can be found in http://www2013.wwwconference.org/proceedings/p343.pdf

I hope this might be of interest! 

Luca
import authorship_attribution

a = authorship_attribution.AuthorshipAttribution.new_attribution_processor(N=4)
a.add_revision("I like to eat pasta".split(), revision_info="rev0")
a.add_revision("I like to eat pasta with tomato sauce".split(), revision_info="rev1")
a.add_revision("I like to eat rice with tomato sauce".split(), revision_info="rev3")
print a.get_attribution()
['rev0', 'rev0', 'rev0', 'rev0', 'rev3', 'rev1', 'rev1', 'rev1']

_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l