Hi, as Luca already mentioned, we (my colleagues Maribel Acosta and Felix Keppmann and me) are also working on an algorithm for authorship detection. Our approach is somewhat different than Luca and Michael's in that we rebuild authorship information for words in paragraphs and sentences via MD5-hashes (i.e. see if they have existed before at any time in the article) and use a Diff algorithm to detect the changes in the parts of the articles that haven't been seen before.
We build up on a older, more basic model of ours as described in the paper Luca already included in his mail [1]. Currently we are at 0,04 sec per revision for the pure calculation, without writing/reading the hashes to/from a database. This is the step we are working on now, to make the method incremental. We will make the code publicly available soon. We would like to contribute as much as we can to the Wikipedia authorship project with our solution and are open for any collaboration.
Another issue is of course accuracy of the found words, for which we will ask the community for input to evaluate it. We have set up a small gold standard set of 184 words and their origin (who wrote them in which revision) which can be found here: [2] . The words were randomly selected and their origin determined manually. I invite everyone to look at this set and make comments about if the postulated revisions of origin in this gold standard set seem to be right and extend it maybe. Although we will run an evaluation with a bigger user base, this serves as a useful starting point for preliminary testing. Right now we reach an accuracy of ~85% with this set (compared to ~50% of the old Wikitrust algorithm, see [1]), although there are still a lot of tuning possibilities in our algorithm.
Best,
Fabian
[1] http://wikipedia-academy.de/2012/w/images/2/24/23_Paper_Fabian_Fl%C3%B6ck_An... [2] https://docs.google.com/spreadsheet/ccc?key=0An7RIRiLIXD5dENITFpmU0c1RVZaU1N...
-- Karlsruhe Institute of Technology (KIT) Institute of Applied Informatics and Formal Description Methods
Dipl.-Medwiss. Fabian Flöck Research Associate
Building 11.40, Room 222 KIT-Campus South D-76128 Karlsruhe
Phone: +49 721 608 4 6584 Fax: +49 721 608 4 6580 Skype: f.floeck_work E-Mail: fabian.floeck@kit.edumailto:fabian.floeck@kit.edu WWW: http://www.aifb.kit.edu/web/Fabian_Fl%C3%B6ck
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association