Hi guys,
I just saw this thread. Great to see your interest in the topic revived, Luca!
Thanks to Aaron for pointing out the related work :)
some comments:
For example, its hard to tell from your description whether you are doing anything
different than the wikiwho api[2] with tracking content historically.
The technique that Luca and Michael’s algorithm (let’s call it A3) uses is quite
different from wikiwho and has other tuning parameters. While A3 is built on the idea of
finding identifying identical tokens via a “rarity function” (in the original paper it was
a 4-token-sequence of the same neighbors, if I recall correctly), wikiwho uses a
hierarchical splitting into paragraphs, sentences and then tokens (with diffing in the
last stage). wikiwho could still be refined by testing different splitting of the text
instead of paragraph and sentences (and a different differ, cf. e.g. Aaron’s work), while
A3 depends much on the defined rarity function to decide if a token is “the same” , which
has not been explored yet to full potential afaik. (correct me if I’m wrong, Luca)
Further the work I have been doing with diff-based content persistence (e.g. [1]) is not
so simple as to not notice removals and re-additions under most circumstances.
FWIW wikiwho also tracks exactly when a token appeared, dissappeared and reappeared,
including if it was a reintroduction, repeated delete, etc. We also added the calculation
of relationships between revisions (and in aggregation: editors), which is the data used
in the whoVIS visualization [1]. It’s all avaliable in the WikiwhoRelationships.py at [2].
The API, however, so far only delivers information about provenance (first appearance and
authors), but in time we will add some parameters to receive that information as well.
In my opinion, this is much better for measuring the productivity of a contribution
(adding content that looks like content that was removed long ago is still productive,
isn't it?),
Good points. One question we always had: How much time has to pass to consider something a
“copy” of someone else’s contribution versus that editor's new, own contribution (and
if it is only the contribution of “re-discovering” good content). I.e., if it the
original text was absent for 2 years, is the re-introduction of good text more productive
than just doing a revert of vandalism after 1 revision? In the current wikiwho
implementation, it’s always attributed to the first author right now, like you said.
Also, what about productive deletes? I’m curious if/how you measure those, Aaron.
Regardless, it seems that a qualitative analysis is necessary to determine whether these
differences matter and whether one strategy is better than the other. AFAICT, the only
software that has received this kind of analysis is wikiwho (discussed in [3]).
I strongly agree that more qualitative analysis of the algorithm outputs is necessary, as
the problem is not that trivial in all cases (as can bee seen from our results in [3],
where we compared wikiwho with one instantiation of A3). I’m also not aware of any other
evaluation than the one we did in the wikiwho paper. But with Wiki Labels (as far as I
understand), we now have a great tool to do more human assessment of provenance and
content persistence.
Anyhow, great to have some discourse about the topic here :)
Gruß,
Fabian
[1]
http://f<http://people.aifb.kit.edu/ffl/wikiwho/fp715-floeck.pdf>-squ…
[2]
https://github.com/maribelacosta/wikiwho
[3]
http://people.aifb.kit.edu/ffl/wikiwho/fp715-floeck.pdf
--
Fabian Flöck
Research Associate
Computational Social Science department @GESIS
Unter Sachsenhausen 6-8, 50667 Cologne, Germany
Tel: + 49 (0) 221-47694-208
fabian.floeck@gesis.org<mailto:fabian.floeck@gesis.org>
www.gesis.org
www.facebook.com/gesis.org
On 22.08.2015, at 17:01, Aaron Halfaker
<ahalfaker@wikimedia.org<mailto:ahalfaker@wikimedia.org>> wrote:
Luca,
No worries. Glad to have your code out there. In a lot of ways, this mailing list is a
public record, so I wanted to make sure there was a good summary of the state to accompany
your announcement. I meant it when I said that I'm glad you are working in this space
and I look forward to working with you. :)
-Aaron
On Sat, Aug 22, 2015 at 7:26 AM, Luca de Alfaro
<luca@dealfaro.com<mailto:luca@dealfaro.com>> wrote:
Sorry, I meant to say: if there is interest in the code for the Mediawiki extension, let
me know, and _we_ will clean it up and put on github (you won't have to clean it up
:-).
Luca
On Sat, Aug 22, 2015 at 7:25 AM, Luca de Alfaro
<luca@dealfaro.com<mailto:luca@dealfaro.com>> wrote:
Thank you Federico. Done.
BTW, we also had code for a Mediawiki extension that computed this in real time. That
code has not yet been cleaned up, but it is available from here:
https://sites.google.com/a/ucsc.edu/luca/the-wikipedia-authorship-project
If there is interest, I don't think it would be hard to clean up and post better to
github.
The extension uses the edit hook to attribute the content of every new revision of a wiki
page, using the "earliest plausible attribution" idea & algo we used in the
paper.
Luca
On Sat, Aug 22, 2015 at 12:20 AM, Federico Leva (Nemo)
<nemowiki@gmail.com<mailto:nemowiki@gmail.com>> wrote:
Luca de Alfaro, 22/08/2015 01:51:
So I got inspired, and I cleaned up some code that Michael Shavlovsky
and I had written for this:
https://github.com/lucadealfaro/authorship-tracking
Great! It's always good when code behind a paper is published, it's never too
late.
If you can please add a link from wikipapers:
http://wikipapers.referata.com/wiki/Form:Tool
Nemo
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org<mailto:Wiki-research-l@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org<mailto:Wiki-research-l@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Gruß,
Fabian
--
Fabian Flöck
Research Associate
Computational Social Science department @GESIS
Unter Sachsenhausen 6-8, 50667 Cologne, Germany
Tel: + 49 (0) 221-47694-208
fabian.floeck@gesis.org<mailto:fabian.floeck@gesis.org>
www.gesis.org
www.facebook.com/gesis.org