Re: [Wiki-research-l] Tracking authorship of wiki content

22 Aug 2015

Dear Aaron,

sorry, sorry, thanks for helping clear up some mis-conceptions, and let me
see if I can do more.

The WikiWho API is very nice work, and was presented in WWW 2014.

The work of Michael and myself dates from one year before, WWW 2013 (see
http://www2013.wwwconference.org/proceedings/p343.pdf).

This is why in our work we don't give credit to WikiWho.  In fact, it is
them who most politely cite us.

Now, you say, why don't I give them more credit now?  Because I haven't
really done anything new.  I am not claiming anything new, I have just
taken code that was written in 2013, and made it better available on
github, with a moderate clean-up of its API.  We tried to make that code
available in 2013 to the community, by putting it into gerritt (we were
told it was the proper place), but it didn't really work out.  Again, I am
not pushing a new result out.  I am simply making code available that dates
from some time back, and that I realized yesterday might be useful to
others.

There are many many ways to attribute content.  Even if you go for the
theory of "earliest possible attribution", which is what we do in the paper
and code, it would certainly be better done using language models of
average text, to better distinguish casual from intentional repetition.

I put the code on github because I was inspired by our conversation
yesterday.  If you like, I'd be happy to give you access to the repo (write
access I mean) so you can both do the code reviews we had been mentioning,
and improve the README.md file with more considerations and references.
Let me know.

Again, what I wanted to do is make better available code written 2-3 years
ago, not really make any new claims.

Luca

On Fri, Aug 21, 2015 at 5:49 PM, Aaron Halfaker &lt;aaron.halfaker(a)gmail.com&gt;
wrote:

...
  Hey Luca!

 Welcome back to the content persistence tracking club!

 I feel like I should clear up some misconceptions.  1st, yours is not the
 first python library that is useful for determining the authorship of
 content in versioned text and I don't think you have given fair treatment
 to the work we have been doing since you last worked on WikiTrust.  For
 example, its hard to tell from your description whether you are doing
 anything different than the wikiwho api[2] with tracking content
 historically.  Further the work I have been doing with diff-based content
 persistence (e.g. [1]) is not so simple as to not notice removals and
 re-additions under most circumstances.

 In my opinion, this is much better for measuring the productivity of a
 contribution (adding content that looks like content that was removed long
 ago is still productive, isn't it?), but maybe less useful for attributing
 a first contributor status to a particular sub-statement.  Regardless, it
 seems that a qualitative analysis is necessary to determine whether these
 differences matter and whether one strategy is better than the other.
 AFAICT, the only software that has received this kind of analysis is
 wikiwho (discussed in [3]).

 Regardless, it's great to have you working in this space again and I
 welcome you to help us develop overview of content persistence measurement
 strategies that is complete and allows others to critically decide which
 strategy matches their needs.   See
 https://meta.wikimedia.org/wiki/Research:Content_persistence for such an
 overview.  I encourage you to use this description of persistence measures
 to differentiate your strategy from the work we have been doing over the
 last 5 years.  Edit boldly!

 1.
 https://pythonhosted.org/mediawiki-utilities/lib/persistence.html#mw-lib-pe…
 2. http://people.aifb.kit.edu/ffl/wikiwho/
 3. http://people.aifb.kit.edu/ffl/wikiwho/fp715-floeck.pdf

 -Aaron

 On Aug 21, 2015 4:52 PM, "Luca de Alfaro" &lt;luca(a)dealfaro.com&gt; wrote:

  Dear All,

 I was yesterday at OpenSym (many thanks to Dirk for organizing this!),
 and I was chatting with some people about attribution of content to its
 authors in a wiki.
 So I got inspired, and I cleaned up some code that Michael Shavlovsky and
 I had written for this:

 https://github.com/lucadealfaro/authorship-tracking

 The way to use it is super simple (see below).  The attribution object
 can also be serialized and de-serialized to/from json (see documentation on
 github).

 The idea behind the code is to attribute the content to the *earliest
 revision *where the content was inserted, not the latest as diff tools
 usually do.  So if some piece of text is inserted, then deleted, then
 re-inserted (in a revert or a normal edit), we still attribute it to the
 earliest revision.  This is somewhat similar to what we tried to do in
 WikiTrust, but it's better done, and far more efficient.

 The algorithm details can be found in
 http://www2013.wwwconference.org/proceedings/p343.pdf

 I hope this might be of interest!

 Luca

 import authorship_attribution

 a = authorship_attribution.AuthorshipAttribution.new_attribution_processor(N=4)
 a.add_revision("I like to eat pasta".split(), revision_info="rev0")
 a.add_revision("I like to eat pasta with tomato sauce".split(),
revision_info="rev1")
 a.add_revision("I like to eat rice with tomato sauce".split(),
revision_info="rev3")print a.get_attribution()

 ['rev0', 'rev0', 'rev0', 'rev0', 'rev3',
'rev1', 'rev1', 'rev1']

 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

  _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wiki-research-l] Tracking authorship of wiki content