Blame maps aka authorship detection - Wikitech-l

26 Feb 2013

Dear All,

Michael Shavlovky and I have been working on blame maps (authorship
detection) for the various Wikipedias.
We have code in the WikiMedia repository that has been written with the
goal to obtain a production system capable of attributing all content (not
just a research demo).  Here are some pointers:

   - Code <https://gerrit.wikimedia.org/r/#/q/blamemaps,n,z>
   - Description of the blame maps mediawiki
extension<https://docs.google.com/document/d/15MEyu5tDZ3mhj_i1fDNFqNxWex…
   - Detailed description of the underlying algorithm, with performance

evaluation<https://www.soe.ucsc.edu/research/technical-reports/ucsc-soe-…
   - Demo <http://blamemaps.wmflabs.org/mw/index.php/Main_Page>

These are also all available from
https://sites.google.com/a/ucsc.edu/luca/the-wikipedia-authorship-project
In brief, for each page we store metadata that summarizes the entire text
evolution of the page; this metadata, compressed, is about three times the
size of a typical revision.  Each time a new revision is made, we read this
metadata, attribute every word of the revision, store updated metadata, and
store authorship data for the revision.  The process takes 1-2 seconds
depending on the average revision size (most of the time is actually
devoted to deserializing and reserializing the metadata).  Comparing with
all previous revisions takes care of things like content that is deleted
and then later re-inserted, and other various attacks that might happen
once authorship is displayed.  I should also add that these algorithms are
independent from the ones in WikiTrust, and should be much better.

We have NOT developed a GUI for this: our plan was just to provide a data
API that gives information on authorship of each word.  There are many ways
to display the information, from page summaries of authorship to detailed
word-by-word information, and we thought that surely others would want to
play with the visualization aspect.

I am writing this message as we hope this might be of interest, and as we
would be quite happy to find people willing to collaborate.  Is anybody
interested in developing a GUI for it and talk to us about what API we
should have for retrieving this authorship information?  Is there anybody
interested in helping to move the code to production-ready stage?

I also would like to mention that Fabian Floeck has developed another very
interesting algorithm for attributing the content, reported in
http://wikipedia-academy.de/2012/w/images/2/24/23_Paper_Fabian_Fl%C3%B6ck_A…
Fabian and I are now starting to collaborate: we want to compare the
algorithms, and work together to obtain something we are happy with, and
that can run in production.

Indeed, I think a reasonable first goal would be to:

   - Define a data API
   - Define some coarse requirements of the system
   - Have a look at above results / algorithms / implementation and advise
   us.

I am sure that the algorithm details can be fine tuned and changed to no
end in a collaborative effort, once the first version is up and running.
 The problem is of putting together a bit of effort to get to that first
running version.

Luca