Dear All,
Michael Shavlovky and I have been working on blame maps (authorship detection) for the various Wikipedias. We have code in the WikiMedia repository that has been written with the goal to obtain a production system capable of attributing all content (not just a research demo). Here are some pointers:
- Code https://gerrit.wikimedia.org/r/#/q/blamemaps,n,z - Description of the blame maps mediawiki extensionhttps://docs.google.com/document/d/15MEyu5tDZ3mhj_i1fDNFqNxWexK-B3BtbYKJlYEKdiQ/edit - Detailed description of the underlying algorithm, with performance evaluationhttps://www.soe.ucsc.edu/research/technical-reports/ucsc-soe-12-21/download - Demo http://blamemaps.wmflabs.org/mw/index.php/Main_Page
These are also all available from https://sites.google.com/a/ucsc.edu/luca/the-wikipedia-authorship-project In brief, for each page we store metadata that summarizes the entire text evolution of the page; this metadata, compressed, is about three times the size of a typical revision. Each time a new revision is made, we read this metadata, attribute every word of the revision, store updated metadata, and store authorship data for the revision. The process takes 1-2 seconds depending on the average revision size (most of the time is actually devoted to deserializing and reserializing the metadata). Comparing with all previous revisions takes care of things like content that is deleted and then later re-inserted, and other various attacks that might happen once authorship is displayed. I should also add that these algorithms are independent from the ones in WikiTrust, and should be much better.
We have NOT developed a GUI for this: our plan was just to provide a data API that gives information on authorship of each word. There are many ways to display the information, from page summaries of authorship to detailed word-by-word information, and we thought that surely others would want to play with the visualization aspect.
I am writing this message as we hope this might be of interest, and as we would be quite happy to find people willing to collaborate. Is anybody interested in developing a GUI for it and talk to us about what API we should have for retrieving this authorship information? Is there anybody interested in helping to move the code to production-ready stage?
I also would like to mention that Fabian Floeck has developed another very interesting algorithm for attributing the content, reported in http://wikipedia-academy.de/2012/w/images/2/24/23_Paper_Fabian_Fl%C3%B6ck_An... Fabian and I are now starting to collaborate: we want to compare the algorithms, and work together to obtain something we are happy with, and that can run in production.
Indeed, I think a reasonable first goal would be to:
- Define a data API - Define some coarse requirements of the system - Have a look at above results / algorithms / implementation and advise us.
I am sure that the algorithm details can be fine tuned and changed to no end in a collaborative effort, once the first version is up and running. The problem is of putting together a bit of effort to get to that first running version.
Luca
On 02/25/2013 09:21 PM, Luca de Alfaro wrote:
I am writing this message as we hope this might be of interest, and as we would be quite happy to find people willing to collaborate. Is anybody interested in developing a GUI for it and talk to us about what API we should have for retrieving this authorship information? Is there anybody interested in helping to move the code to production-ready stage?
Are you planning to run this live in production (i.e. 1-2 seconds on every save)?
I think people would be reluctant to slow writes down further. You could potentially do it deferred, or in the job queue, but I think it might make more sense on something like Wikimedia Labs (https://www.mediawiki.org/wiki/Wikimedia_Labs)
Did you try doing it with no caching (similar to git blame, though I know it's a different algorithm)? I'm wondering how much benefit you get from the cached info.
Matt Flaschen
I agree: in fact we don't do it in the write pipeline. The code we wrote implements a simple queue, where page_id are queued for processing. The processing job then gets a page_id out of that table, and processes all the missing revisions for that page_id. So this is useful also if (say) there is a page merge or something similar: we can just erase all authorship information for that page, and at the next edit, it will be rebuilt.
What we wrote can work also on labs, but:
- We need a way to poll the database for things like what are all revision_ids of a given page. We could use the API instead, but it's less efficient. - We need a way to read the text of revisions. Again, the API can work, but having better access is better. - We need a place where to store the authorship information. This is several terabytes for enwiki. Basically, we need access to some text store. Is this available on labs?
We would welcome more information on how much of the above is feasible on labs.
Luca
On Mon, Feb 25, 2013 at 7:27 PM, Matthew Flaschen mflaschen@wikimedia.orgwrote:
On 02/25/2013 09:21 PM, Luca de Alfaro wrote:
I am writing this message as we hope this might be of interest, and as we would be quite happy to find people willing to collaborate. Is anybody interested in developing a GUI for it and talk to us about what API we should have for retrieving this authorship information? Is there anybody interested in helping to move the code to production-ready stage?
Are you planning to run this live in production (i.e. 1-2 seconds on every save)?
I think people would be reluctant to slow writes down further. You could potentially do it deferred, or in the job queue, but I think it might make more sense on something like Wikimedia Labs (https://www.mediawiki.org/wiki/Wikimedia_Labs)
Did you try doing it with no caching (similar to git blame, though I know it's a different algorithm)? I'm wondering how much benefit you get from the cached info.
Matt Flaschen
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
It sounds like some of those things should be working in labs soon with DB replication. I doubt they'll let you store terabytes though.
Alex Monk
On 26/02/13 07:29, Luca de Alfaro wrote:
What we wrote can work also on labs, but:
- We need a way to poll the database for things like what are all revision_ids of a given page. We could use the API instead, but it's less efficient. - We need a way to read the text of revisions. Again, the API can work, but having better access is better. - We need a place where to store the authorship information. This is several terabytes for enwiki. Basically, we need access to some text store. Is this available on labs?
Hi Luca,
we are working on somewhat related issues in Parsoid [1][2]. The modified HTML DOM is diffed vs. the original DOM on the way in. Each modified node is annotated with the base revision. We don't store this information yet- right now we use it to selectively serialize modified parts of the page back to wikitext. We will however soon store the HTML along with wikitext for each revision, which should make it possible to display a coarse blame map.
There are several limitations:
* We don't preserve blame information on wikitext edits yet. This should become possible with the incremental re-parsing optimization which is on our roadmap for this summer.
* Our DOM diff algorithm is extremely simplistic. We are considering to port XyDiff for better move detection.
* The information is pretty coarse at a node level. Refining this to a word level would require an efficient encoding for that information, possibly as length/revision pairs associated with the wrapping element.
* We have not moved metadata from attributes to a metadata section with efficient encoding yet.
We don't currently plan to work on blame maps ourselves. Maybe there are opportunities for collaboration?
Gabriel
[1]: http://www.mediawiki.org/wiki/Parsoid [2]: http://www.mediawiki.org/wiki/Parsoid/Roadmap
Hi, as Luca already mentioned, we (my colleagues Maribel Acosta and Felix Keppmann and me) are also working on an algorithm for authorship detection. Our approach is somewhat different than Luca and Michael's in that we rebuild authorship information for words in paragraphs and sentences via MD5-hashes (i.e. see if they have existed before at any time in the article) and use a Diff algorithm to detect the changes in the parts of the articles that haven't been seen before.
We build up on a older, more basic model of ours as described in the paper Luca already included in his mail [1]. Currently we are at 0,04 sec per revision for the pure calculation, without writing/reading the hashes to/from a database. This is the step we are working on now, to make the method incremental. We will make the code publicly available soon. We would like to contribute as much as we can to the Wikipedia authorship project with our solution and are open for any collaboration.
Another issue is of course accuracy of the found words, for which we will ask the community for input to evaluate it. We have set up a small gold standard set of 184 words and their origin (who wrote them in which revision) which can be found here: [2] . The words were randomly selected and their origin determined manually. I invite everyone to look at this set and make comments about if the postulated revisions of origin in this gold standard set seem to be right and extend it maybe. Although we will run an evaluation with a bigger user base, this serves as a useful starting point for preliminary testing. Right now we reach an accuracy of ~85% with this set (compared to ~50% of the old Wikitrust algorithm, see [1]), although there are still a lot of tuning possibilities in our algorithm.
Best,
Fabian
[1] http://wikipedia-academy.de/2012/w/images/2/24/23_Paper_Fabian_Fl%C3%B6ck_An... [2] https://docs.google.com/spreadsheet/ccc?key=0An7RIRiLIXD5dENITFpmU0c1RVZaU1N...
-- Karlsruhe Institute of Technology (KIT) Institute of Applied Informatics and Formal Description Methods
Dipl.-Medwiss. Fabian Flöck Research Associate
Building 11.40, Room 222 KIT-Campus South D-76128 Karlsruhe
Phone: +49 721 608 4 6584 Fax: +49 721 608 4 6580 Skype: f.floeck_work E-Mail: fabian.floeck@kit.edumailto:fabian.floeck@kit.edu WWW: http://www.aifb.kit.edu/web/Fabian_Fl%C3%B6ck
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association
On 02/26/2013 02:29 AM, Luca de Alfaro wrote:
- We need a way to poll the database for things like what are all
revision_ids of a given page. We could use the API instead, but it's less efficient.
Yes, as others have said LAbs should allow that either now or shortly. You should sign up for https://lists.wikimedia.org/mailman/listinfo/labs-l and feel free to ask Labs questions there.
- We need a place where to store the authorship information. This is
several terabytes for enwiki. Basically, we need access to some text store. Is this available on labs?
I don't know if you'll be able to get that or not. You'll have to make a special request.
Matt Flaschen
On 02/25/2013 06:21 PM, Luca de Alfaro wrote:
I am writing this message as we hope this might be of interest, and as we would be quite happy to find people willing to collaborate. Is anybody interested in developing a GUI for it and talk to us about what API we should have for retrieving this authorship information? Is there anybody interested in helping to move the code to production-ready stage?
I'm emphasizing this message. Thanks for the roundup, Luca!
On 02/25/2013 09:21 PM, Luca de Alfaro wrote:
The problem is of putting together a bit of effort to get to that first running version.
How big are the wikis that you've tried this on? Would smaller academic wikis be able to use this code?
I may have a use for your code since one of the wikis I'm working on is targeted to academics where getting a citation really improves wiki participation.
I have briefly toyed with something similar. Unlike yours, it has a (very simple and rudimentary) interface, but no sophisticated algorithms inside :) – just a standard LCS diff library. It also works in real time (but is awfully slow).
It can be seen at http://wikiblame.heroku.com/ (source at https://github.com/MatmaRex/wikiblame) – there's some weird bug right now that makes it fail for titles with non-ASCII characters that I haven't had time to investigate, and due to free platform limitations it'll fail if generation of the blame map would takes over 30 seconds (that would be for most articles > 3 kB or with more than 50 revisions); I was intending to move it to some toolserver or labs or something, but haven't had time for this either.
I've also seen some gadget on en.wiki that did something similar, but I don't remember the name and can't find it right now.
your site doesn't work
http://blamemaps.wmflabs.org/mw/index.php/Main_Page -> the connection timed out
On Tue, Feb 26, 2013 at 5:52 PM, Bartosz Dziewoński matma.rex@gmail.com wrote:
I have briefly toyed with something similar. Unlike yours, it has a (very simple and rudimentary) interface, but no sophisticated algorithms inside :) – just a standard LCS diff library. It also works in real time (but is awfully slow).
It can be seen at http://wikiblame.heroku.com/ (source at https://github.com/MatmaRex/wikiblame) – there's some weird bug right now that makes it fail for titles with non-ASCII characters that I haven't had time to investigate, and due to free platform limitations it'll fail if generation of the blame map would takes over 30 seconds (that would be for most articles > 3 kB or with more than 50 revisions); I was intending to move it to some toolserver or labs or something, but haven't had time for this either.
I've also seen some gadget on en.wiki that did something similar, but I don't remember the name and can't find it right now.
-- Matma Rex
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org