Hi,
as Luca already mentioned, we (my colleagues Maribel Acosta and Felix Keppmann and me) are
also working on an algorithm for authorship detection. Our approach is somewhat different
than Luca and Michael's in that we rebuild authorship information for words in
paragraphs and sentences via MD5-hashes (i.e. see if they have existed before at any time
in the article) and use a Diff algorithm to detect the changes in the parts of the
articles that haven't been seen before.
We build up on a older, more basic model of ours as described in the paper Luca already
included in his mail [1]. Currently we are at 0,04 sec per revision for the pure
calculation, without writing/reading the hashes to/from a database. This is the step we
are working on now, to make the method incremental. We will make the code publicly
available soon. We would like to contribute as much as we can to the Wikipedia authorship
project with our solution and are open for any collaboration.
Another issue is of course accuracy of the found words, for which we will ask the
community for input to evaluate it. We have set up a small gold standard set of 184 words
and their origin (who wrote them in which revision) which can be found here: [2] . The
words were randomly selected and their origin determined manually. I invite everyone to
look at this set and make comments about if the postulated revisions of origin in this
gold standard set seem to be right and extend it maybe. Although we will run an evaluation
with a bigger user base, this serves as a useful starting point for preliminary testing.
Right now we reach an accuracy of ~85% with this set (compared to ~50% of the old
Wikitrust algorithm, see [1]), although there are still a lot of tuning possibilities in
our algorithm.
Best,
Fabian
[1]
http://wikipedia-academy.de/2012/w/images/2/24/23_Paper_Fabian_Fl%C3%B6ck_A…
[2]
https://docs.google.com/spreadsheet/ccc?key=0An7RIRiLIXD5dENITFpmU0c1RVZaU1…
--
Karlsruhe Institute of Technology (KIT)
Institute of Applied Informatics and Formal Description Methods
Dipl.-Medwiss. Fabian Flöck
Research Associate
Building 11.40, Room 222
KIT-Campus South
D-76128 Karlsruhe
Phone: +49 721 608 4 6584
Fax: +49 721 608 4 6580
Skype: f.floeck_work
E-Mail: fabian.floeck@kit.edu<mailto:fabian.floeck@kit.edu>
WWW:
http://www.aifb.kit.edu/web/Fabian_Flöck
KIT – University of the State of Baden-Wuerttemberg and
National Research Center of the Helmholtz Association