Hi all,
I've been directed here by Brion, Robchurch and others on #wikimedia-tech. So I propose a new feature for Wikipedia which people on #wikimedia-tech mostly refer as blame page or blame map. I would prefer to call it something like "Track contributions mode" (because of similarity with MS Word track changes mode) or "Hall of fame" but whatever. I have live prototype written in PHP&MySQL at http://217.147.83.36:9000/ Example of "blame map" can be seen at http://217.147.83.36:9000/history::171 two blame maps compared http://217.147.83.36:9000/history::171=169
For some reason folks at #wikimedia-tech. were mainly concerned with speed and almost nothing else so I'll try explaining performance issues as best as I can.
First of all, I DO NOT propose to recalculate diffs for all zillions of edits Wikipedia already has. Diffs would only be calculated for a new edits. Next, I want to explain in detail how I see this working. So first I propose to modify revision table and add a flag with following possible values: "Revision is too old to be diffed", "Revision is awaiting to be diffed", "Revision has been diffed". Also another table should be added that will store blame maps for each revision. Blame map for each subsequent revision will be calculated incrementally. So it doesn't really matter whether article has 10 or 1000 revisions. We would only need last blame map.
I also propose to have separate dedicated diff server(s) with sole job to calculate diffs in background. I.e. diff server grabs revision with "Revision is awaiting to be diffed" flag and last blame map from database, calculates diff and finally stores new blame map in the database and also changes revision flag to "Revision has been diffed". Repeat.
In addition, article display logic should be altered. The module that displays article should check diff flag. If diff flag is set to "Revision is too old to be diffed" no further changes needed. If diff flag is set to "Revision is awaiting to be diffed" then Credits section should be created that only contains message "Calculation in progress". If diff flag is set to "Revision has been diffed" then Credits section should be created that contains list of contributors ordered by contribution size. The list of contributors in correct order can be generated with a single select to blame map table. In addition this select can be cached. Direct link to blame map should be displayed too. If user clicks on this blame map link corresponding blame map should be presented. Every blame map can be generated with a single select and can be placed in cache. Yawn
If you are still awake by now more thoughts on fault tolerance here. Should diff server die, crash, fail or whatever the only side effect end user will see is "Calculation in progress" message right after article body. That's it. No slowdown or anything. If user still wants see some kind of diff he/she can still use old diff engine. Because blame maps aren't calculated in real time this feature is impractical target for DoS attacks. However I should point out that any real time diff algorithm is one big fat target for DoS attacks on other wikis which are run on single server without some sort of acceleration.
There is also small Unicode issue. Due to crappy utf-8 support in PHP all non-latin characters are currently ignored. I believe this could be solved either by enabling proper Unicode support in PHP or writing custom code to separate words. But before that I propose to test on English Wikipedia first because if it will works for English it should work for other languages.
So I offer following practical steps. Dedicate one of servers to be diff playground. I will need a shell account on this server. Install mediawiki on it alongside with diff logic running in background. Create read only mysql account on live database server. So as a result this diff server can grab new revisions from live database, diff them and store results locally. This way we can find out how many edits single server can process and see how many servers this feature will require in total (I don't think it will be more than 2-3 though).
In conclusion, I'd like to say that in my opinion this feature will be useful and practical if implemented. It also can be crucial building block for other interesting features. However, I want to stress that I'm not interested in doing this *unless* it is used in English Wikipedia and I'm given appropriate credit. I can give a reason why I want that in private e-mail.
Thank you for reading this long and boring e-mail.