Dear All,
Michael Shavlovky and I have been working on blame maps (authorship detection) for the various Wikipedias. We have code in the WikiMedia repository that has been written with the goal to obtain a production system capable of attributing all content (not just a research demo). Here are some pointers:
- Code https://gerrit.wikimedia.org/r/#/q/blamemaps,n,z - Description of the blame maps mediawiki extensionhttps://docs.google.com/document/d/15MEyu5tDZ3mhj_i1fDNFqNxWexK-B3BtbYKJlYEKdiQ/edit - Detailed description of the underlying algorithm, with performance evaluationhttps://www.soe.ucsc.edu/research/technical-reports/ucsc-soe-12-21/download - Demo http://blamemaps.wmflabs.org/mw/index.php/Main_Page
These are also all available from https://sites.google.com/a/ucsc.edu/luca/the-wikipedia-authorship-project In brief, for each page we store metadata that summarizes the entire text evolution of the page; this metadata, compressed, is about three times the size of a typical revision. Each time a new revision is made, we read this metadata, attribute every word of the revision, store updated metadata, and store authorship data for the revision. The process takes 1-2 seconds depending on the average revision size (most of the time is actually devoted to deserializing and reserializing the metadata). Comparing with all previous revisions takes care of things like content that is deleted and then later re-inserted, and other various attacks that might happen once authorship is displayed. I should also add that these algorithms are independent from the ones in WikiTrust, and should be much better.
We have NOT developed a GUI for this: our plan was just to provide a data API that gives information on authorship of each word. There are many ways to display the information, from page summaries of authorship to detailed word-by-word information, and we thought that surely others would want to play with the visualization aspect.
I am writing this message as we hope this might be of interest, and as we would be quite happy to find people willing to collaborate. Is anybody interested in developing a GUI for it and talk to us about what API we should have for retrieving this authorship information? Is there anybody interested in helping to move the code to production-ready stage?
I also would like to mention that Fabian Floeck has developed another very interesting algorithm for attributing the content, reported in http://wikipedia-academy.de/2012/w/images/2/24/23_Paper_Fabian_Fl%C3%B6ck_An... Fabian and I are now starting to collaborate: we want to compare the algorithms, and work together to obtain something we are happy with, and that can run in production.
Indeed, I think a reasonable first goal would be to:
- Define a data API - Define some coarse requirements of the system - Have a look at above results / algorithms / implementation and advise us.
I am sure that the algorithm details can be fine tuned and changed to no end in a collaborative effort, once the first version is up and running. The problem is of putting together a bit of effort to get to that first running version.
Luca