I agree: in fact we don't do it in the write pipeline. The code we wrote implements a simple queue, where page_id are queued for processing. The processing job then gets a page_id out of that table, and processes all the missing revisions for that page_id. So this is useful also if (say) there is a page merge or something similar: we can just erase all authorship information for that page, and at the next edit, it will be rebuilt.
What we wrote can work also on labs, but:
- We need a way to poll the database for things like what are all revision_ids of a given page. We could use the API instead, but it's less efficient. - We need a way to read the text of revisions. Again, the API can work, but having better access is better. - We need a place where to store the authorship information. This is several terabytes for enwiki. Basically, we need access to some text store. Is this available on labs?
We would welcome more information on how much of the above is feasible on labs.
Luca
On Mon, Feb 25, 2013 at 7:27 PM, Matthew Flaschen mflaschen@wikimedia.orgwrote:
On 02/25/2013 09:21 PM, Luca de Alfaro wrote:
I am writing this message as we hope this might be of interest, and as we would be quite happy to find people willing to collaborate. Is anybody interested in developing a GUI for it and talk to us about what API we should have for retrieving this authorship information? Is there anybody interested in helping to move the code to production-ready stage?
Are you planning to run this live in production (i.e. 1-2 seconds on every save)?
I think people would be reluctant to slow writes down further. You could potentially do it deferred, or in the job queue, but I think it might make more sense on something like Wikimedia Labs (https://www.mediawiki.org/wiki/Wikimedia_Labs)
Did you try doing it with no caching (similar to git blame, though I know it's a different algorithm)? I'm wondering how much benefit you get from the cached info.
Matt Flaschen
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l