Tracking authorship of wiki content

List overview All Threads
Download

newer

older

Wikidata quality

Notes from OpenSym @ Wikimedia...

Luca de Alfaro

22 Aug 2015 22 Aug '15

5:21 a.m.

Dear All,

I was yesterday at OpenSym (many thanks to Dirk for organizing this!), and I was chatting with some people about attribution of content to its authors in a wiki. So I got inspired, and I cleaned up some code that Michael Shavlovsky and I had written for this:

https://github.com/lucadealfaro/authorship-tracking

The way to use it is super simple (see below). The attribution object can also be serialized and de-serialized to/from json (see documentation on github).

The idea behind the code is to attribute the content to the *earliest revision *where the content was inserted, not the latest as diff tools usually do. So if some piece of text is inserted, then deleted, then re-inserted (in a revert or a normal edit), we still attribute it to the earliest revision. This is somewhat similar to what we tried to do in WikiTrust, but it's better done, and far more efficient.

The algorithm details can be found in http://www2013.wwwconference.org/proceedings/p343.pdf

I hope this might be of interest!

Luca

import authorship_attribution

a = authorship_attribution.AuthorshipAttribution.new_attribution_processor(N=4) a.add_revision("I like to eat pasta".split(), revision_info="rev0") a.add_revision("I like to eat pasta with tomato sauce".split(), revision_info="rev1") a.add_revision("I like to eat rice with tomato sauce".split(), revision_info="rev3")print a.get_attribution()

['rev0', 'rev0', 'rev0', 'rev0', 'rev3', 'rev1', 'rev1', 'rev1']

Attachments:

attachment.htm (text/html — 4.7 KB)

Show replies by date

Aaron Halfaker

22 Aug 22 Aug

6:19 a.m.

Hey Luca!

Welcome back to the content persistence tracking club!

I feel like I should clear up some misconceptions. 1st, yours is not the first python library that is useful for determining the authorship of content in versioned text and I don't think you have given fair treatment to the work we have been doing since you last worked on WikiTrust. For example, its hard to tell from your description whether you are doing anything different than the wikiwho api[2] with tracking content historically. Further the work I have been doing with diff-based content persistence (e.g. [1]) is not so simple as to not notice removals and re-additions under most circumstances.

In my opinion, this is much better for measuring the productivity of a contribution (adding content that looks like content that was removed long ago is still productive, isn't it?), but maybe less useful for attributing a first contributor status to a particular sub-statement. Regardless, it seems that a qualitative analysis is necessary to determine whether these differences matter and whether one strategy is better than the other. AFAICT, the only software that has received this kind of analysis is wikiwho (discussed in [3]).

Regardless, it's great to have you working in this space again and I welcome you to help us develop overview of content persistence measurement strategies that is complete and allows others to critically decide which strategy matches their needs. See https://meta.wikimedia.org/wiki/Research:Content_persistence for such an overview. I encourage you to use this description of persistence measures to differentiate your strategy from the work we have been doing over the last 5 years. Edit boldly!

1. https://pythonhosted.org/mediawiki-utilities/lib/persistence.html#mw-lib-per... 2. http://people.aifb.kit.edu/ffl/wikiwho/ 3. http://people.aifb.kit.edu/ffl/wikiwho/fp715-floeck.pdf

-Aaron

On Aug 21, 2015 4:52 PM, "Luca de Alfaro" luca@dealfaro.com wrote:

...

Dear All,

I was yesterday at OpenSym (many thanks to Dirk for organizing this!), and I was chatting with some people about attribution of content to its authors in a wiki. So I got inspired, and I cleaned up some code that Michael Shavlovsky and I had written for this:

https://github.com/lucadealfaro/authorship-tracking

The way to use it is super simple (see below). The attribution object can also be serialized and de-serialized to/from json (see documentation on github).

The idea behind the code is to attribute the content to the *earliest revision *where the content was inserted, not the latest as diff tools usually do. So if some piece of text is inserted, then deleted, then re-inserted (in a revert or a normal edit), we still attribute it to the earliest revision. This is somewhat similar to what we tried to do in WikiTrust, but it's better done, and far more efficient.

The algorithm details can be found in http://www2013.wwwconference.org/proceedings/p343.pdf

I hope this might be of interest!

Luca

import authorship_attribution

a = authorship_attribution.AuthorshipAttribution.new_attribution_processor(N=4) a.add_revision("I like to eat pasta".split(), revision_info="rev0") a.add_revision("I like to eat pasta with tomato sauce".split(), revision_info="rev1") a.add_revision("I like to eat rice with tomato sauce".split(), revision_info="rev3")print a.get_attribution()

['rev0', 'rev0', 'rev0', 'rev0', 'rev3', 'rev1', 'rev1', 'rev1']

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Luca de Alfaro

7:11 a.m.

Dear Aaron,

sorry, sorry, thanks for helping clear up some mis-conceptions, and let me see if I can do more.

The WikiWho API is very nice work, and was presented in WWW 2014.

The work of Michael and myself dates from one year before, WWW 2013 (see http://www2013.wwwconference.org/proceedings/p343.pdf).

This is why in our work we don't give credit to WikiWho. In fact, it is them who most politely cite us.

Now, you say, why don't I give them more credit now? Because I haven't really done anything new. I am not claiming anything new, I have just taken code that was written in 2013, and made it better available on github, with a moderate clean-up of its API. We tried to make that code available in 2013 to the community, by putting it into gerritt (we were told it was the proper place), but it didn't really work out. Again, I am not pushing a new result out. I am simply making code available that dates from some time back, and that I realized yesterday might be useful to others.

There are many many ways to attribute content. Even if you go for the theory of "earliest possible attribution", which is what we do in the paper and code, it would certainly be better done using language models of average text, to better distinguish casual from intentional repetition.

I put the code on github because I was inspired by our conversation yesterday. If you like, I'd be happy to give you access to the repo (write access I mean) so you can both do the code reviews we had been mentioning, and improve the README.md file with more considerations and references. Let me know.

Again, what I wanted to do is make better available code written 2-3 years ago, not really make any new claims.

Luca

On Fri, Aug 21, 2015 at 5:49 PM, Aaron Halfaker aaron.halfaker@gmail.com wrote:

...

Hey Luca!

Welcome back to the content persistence tracking club!

I feel like I should clear up some misconceptions. 1st, yours is not the first python library that is useful for determining the authorship of content in versioned text and I don't think you have given fair treatment to the work we have been doing since you last worked on WikiTrust. For example, its hard to tell from your description whether you are doing anything different than the wikiwho api[2] with tracking content historically. Further the work I have been doing with diff-based content persistence (e.g. [1]) is not so simple as to not notice removals and re-additions under most circumstances.

In my opinion, this is much better for measuring the productivity of a contribution (adding content that looks like content that was removed long ago is still productive, isn't it?), but maybe less useful for attributing a first contributor status to a particular sub-statement. Regardless, it seems that a qualitative analysis is necessary to determine whether these differences matter and whether one strategy is better than the other. AFAICT, the only software that has received this kind of analysis is wikiwho (discussed in [3]).

Regardless, it's great to have you working in this space again and I welcome you to help us develop overview of content persistence measurement strategies that is complete and allows others to critically decide which strategy matches their needs. See https://meta.wikimedia.org/wiki/Research:Content_persistence for such an overview. I encourage you to use this description of persistence measures to differentiate your strategy from the work we have been doing over the last 5 years. Edit boldly!

https://pythonhosted.org/mediawiki-utilities/lib/persistence.html#mw-lib-per... 2. http://people.aifb.kit.edu/ffl/wikiwho/ 3. http://people.aifb.kit.edu/ffl/wikiwho/fp715-floeck.pdf

-Aaron

On Aug 21, 2015 4:52 PM, "Luca de Alfaro" luca@dealfaro.com wrote:

...
Dear All,

I was yesterday at OpenSym (many thanks to Dirk for organizing this!), and I was chatting with some people about attribution of content to its authors in a wiki. So I got inspired, and I cleaned up some code that Michael Shavlovsky and I had written for this:

https://github.com/lucadealfaro/authorship-tracking

The way to use it is super simple (see below). The attribution object can also be serialized and de-serialized to/from json (see documentation on github).

The idea behind the code is to attribute the content to the *earliest revision *where the content was inserted, not the latest as diff tools usually do. So if some piece of text is inserted, then deleted, then re-inserted (in a revert or a normal edit), we still attribute it to the earliest revision. This is somewhat similar to what we tried to do in WikiTrust, but it's better done, and far more efficient.

The algorithm details can be found in http://www2013.wwwconference.org/proceedings/p343.pdf

I hope this might be of interest!

Luca

import authorship_attribution

a = authorship_attribution.AuthorshipAttribution.new_attribution_processor(N=4) a.add_revision("I like to eat pasta".split(), revision_info="rev0") a.add_revision("I like to eat pasta with tomato sauce".split(), revision_info="rev1") a.add_revision("I like to eat rice with tomato sauce".split(), revision_info="rev3")print a.get_attribution()

['rev0', 'rev0', 'rev0', 'rev0', 'rev3', 'rev1', 'rev1', 'rev1']

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Federico Leva (Nemo)

12:50 p.m.

Luca de Alfaro, 22/08/2015 01:51:

...

So I got inspired, and I cleaned up some code that Michael Shavlovsky and I had written for this:

https://github.com/lucadealfaro/authorship-tracking

Great! It's always good when code behind a paper is published, it's never too late. If you can please add a link from wikipapers: http://wikipapers.referata.com/wiki/Form:Tool

Nemo

Luca de Alfaro

7:55 p.m.

Thank you Federico. Done.

BTW, we also had code for a Mediawiki extension that computed this in real time. That code has not yet been cleaned up, but it is available from here: https://sites.google.com/a/ucsc.edu/luca/the-wikipedia-authorship-project If there is interest, I don't think it would be hard to clean up and post better to github. The extension uses the edit hook to attribute the content of every new revision of a wiki page, using the "earliest plausible attribution" idea & algo we used in the paper.

Luca

On Sat, Aug 22, 2015 at 12:20 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:

...

Luca de Alfaro, 22/08/2015 01:51:

...
So I got inspired, and I cleaned up some code that Michael Shavlovsky and I had written for this:

https://github.com/lucadealfaro/authorship-tracking

Great! It's always good when code behind a paper is published, it's never too late. If you can please add a link from wikipapers: http://wikipapers.referata.com/wiki/Form:Tool

Nemo

Luca de Alfaro

7:56 p.m.

Sorry, I meant to say: if there is interest in the code for the Mediawiki extension, let me know, and _we_ will clean it up and put on github (you won't have to clean it up :-). Luca

On Sat, Aug 22, 2015 at 7:25 AM, Luca de Alfaro luca@dealfaro.com wrote:

...

Thank you Federico. Done.

BTW, we also had code for a Mediawiki extension that computed this in real time. That code has not yet been cleaned up, but it is available from here: https://sites.google.com/a/ucsc.edu/luca/the-wikipedia-authorship-project If there is interest, I don't think it would be hard to clean up and post better to github. The extension uses the edit hook to attribute the content of every new revision of a wiki page, using the "earliest plausible attribution" idea & algo we used in the paper.

Luca

On Sat, Aug 22, 2015 at 12:20 AM, Federico Leva (Nemo) <nemowiki@gmail.com

...
wrote:

...
Luca de Alfaro, 22/08/2015 01:51:

...
So I got inspired, and I cleaned up some code that Michael Shavlovsky and I had written for this:

https://github.com/lucadealfaro/authorship-tracking

Great! It's always good when code behind a paper is published, it's never too late. If you can please add a link from wikipapers: http://wikipapers.referata.com/wiki/Form:Tool

Nemo

Aaron Halfaker

8:31 p.m.

Luca,

No worries. Glad to have your code out there. In a lot of ways, this mailing list is a public record, so I wanted to make sure there was a good summary of the state to accompany your announcement. I meant it when I said that I'm glad you are working in this space and I look forward to working with you. :)

-Aaron

On Sat, Aug 22, 2015 at 7:26 AM, Luca de Alfaro luca@dealfaro.com wrote:

...

Sorry, I meant to say: if there is interest in the code for the Mediawiki extension, let me know, and _we_ will clean it up and put on github (you won't have to clean it up :-). Luca

On Sat, Aug 22, 2015 at 7:25 AM, Luca de Alfaro luca@dealfaro.com wrote:

...
Thank you Federico. Done.

BTW, we also had code for a Mediawiki extension that computed this in real time. That code has not yet been cleaned up, but it is available from here: https://sites.google.com/a/ucsc.edu/luca/the-wikipedia-authorship-project If there is interest, I don't think it would be hard to clean up and post better to github. The extension uses the edit hook to attribute the content of every new revision of a wiki page, using the "earliest plausible attribution" idea & algo we used in the paper.

Luca

On Sat, Aug 22, 2015 at 12:20 AM, Federico Leva (Nemo) < nemowiki@gmail.com> wrote:

...
Luca de Alfaro, 22/08/2015 01:51:

...
So I got inspired, and I cleaned up some code that Michael Shavlovsky and I had written for this:

https://github.com/lucadealfaro/authorship-tracking

Great! It's always good when code behind a paper is published, it's never too late. If you can please add a link from wikipapers: http://wikipapers.referata.com/wiki/Form:Tool

Nemo

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Flöck, Fabian

24 Aug 24 Aug

4:01 p.m.

Hi guys,

I just saw this thread. Great to see your interest in the topic revived, Luca! Thanks to Aaron for pointing out the related work :)

some comments:

For example, its hard to tell from your description whether you are doing anything different than the wikiwho api[2] with tracking content historically.

The technique that Luca and Michael’s algorithm (let’s call it A3) uses is quite different from wikiwho and has other tuning parameters. While A3 is built on the idea of finding identifying identical tokens via a “rarity function” (in the original paper it was a 4-token-sequence of the same neighbors, if I recall correctly), wikiwho uses a hierarchical splitting into paragraphs, sentences and then tokens (with diffing in the last stage). wikiwho could still be refined by testing different splitting of the text instead of paragraph and sentences (and a different differ, cf. e.g. Aaron’s work), while A3 depends much on the defined rarity function to decide if a token is “the same” , which has not been explored yet to full potential afaik. (correct me if I’m wrong, Luca)

Further the work I have been doing with diff-based content persistence (e.g. [1]) is not so simple as to not notice removals and re-additions under most circumstances.

FWIW wikiwho also tracks exactly when a token appeared, dissappeared and reappeared, including if it was a reintroduction, repeated delete, etc. We also added the calculation of relationships between revisions (and in aggregation: editors), which is the data used in the whoVIS visualization [1]. It’s all avaliable in the WikiwhoRelationships.py at [2]. The API, however, so far only delivers information about provenance (first appearance and authors), but in time we will add some parameters to receive that information as well.

In my opinion, this is much better for measuring the productivity of a contribution (adding content that looks like content that was removed long ago is still productive, isn't it?),

Good points. One question we always had: How much time has to pass to consider something a “copy” of someone else’s contribution versus that editor's new, own contribution (and if it is only the contribution of “re-discovering” good content). I.e., if it the original text was absent for 2 years, is the re-introduction of good text more productive than just doing a revert of vandalism after 1 revision? In the current wikiwho implementation, it’s always attributed to the first author right now, like you said.

Also, what about productive deletes? I’m curious if/how you measure those, Aaron.

Regardless, it seems that a qualitative analysis is necessary to determine whether these differences matter and whether one strategy is better than the other. AFAICT, the only software that has received this kind of analysis is wikiwho (discussed in [3]).

I strongly agree that more qualitative analysis of the algorithm outputs is necessary, as the problem is not that trivial in all cases (as can bee seen from our results in [3], where we compared wikiwho with one instantiation of A3). I’m also not aware of any other evaluation than the one we did in the wikiwho paper. But with Wiki Labels (as far as I understand), we now have a great tool to do more human assessment of provenance and content persistence.

Anyhow, great to have some discourse about the topic here :)

Gruß, Fabian

[1] http://fhttp://people.aifb.kit.edu/ffl/wikiwho/fp715-floeck.pdf-squared.org/whovisualhttp://squared.org/whovisual [2] https://github.com/maribelacosta/wikiwho [3] http://people.aifb.kit.edu/ffl/wikiwho/fp715-floeck.pdf

-- Fabian Flöck Research Associate Computational Social Science department @GESIS Unter Sachsenhausen 6-8, 50667 Cologne, Germany Tel: + 49 (0) 221-47694-208 fabian.floeck@gesis.orgmailto:fabian.floeck@gesis.org

www.gesis.org www.facebook.com/gesis.org

On 22.08.2015, at 17:01, Aaron Halfaker <ahalfaker@wikimedia.orgmailto:ahalfaker@wikimedia.org> wrote:

Luca,

-Aaron

On Sat, Aug 22, 2015 at 7:26 AM, Luca de Alfaro <luca@dealfaro.commailto:luca@dealfaro.com> wrote: Sorry, I meant to say: if there is interest in the code for the Mediawiki extension, let me know, and _we_ will clean it up and put on github (you won't have to clean it up :-). Luca

On Sat, Aug 22, 2015 at 7:25 AM, Luca de Alfaro <luca@dealfaro.commailto:luca@dealfaro.com> wrote: Thank you Federico. Done.

Luca

On Sat, Aug 22, 2015 at 12:20 AM, Federico Leva (Nemo) <nemowiki@gmail.commailto:nemowiki@gmail.com> wrote: Luca de Alfaro, 22/08/2015 01:51: So I got inspired, and I cleaned up some code that Michael Shavlovsky and I had written for this:

https://github.com/lucadealfaro/authorship-tracking

Great! It's always good when code behind a paper is published, it's never too late. If you can please add a link from wikipapers: http://wikipapers.referata.com/wiki/Form:Tool

Nemo

_______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.orgmailto:Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Gruß, Fabian

www.gesis.org www.facebook.com/gesis.org

3409

Age (days ago)

3412

Last active (days ago)

wiki-research-l@lists.wikimedia.org

7 comments

5 participants

tags (0)

participants (5)

Aaron Halfaker
Aaron Halfaker
Federico Leva (Nemo)
Flöck, Fabian
Luca de Alfaro