New subject: New diff feature for MediaWiki

6 Jun 2006


      Hi all,
I've been directed here by Brion, Robchurch and others on #wikimedia-tech.
So I propose a new feature for Wikipedia which people on
#wikimedia-tech mostly refer as blame page or blame map. I would
prefer to call it something like "Track contributions mode" (because
of similarity with MS Word track changes mode) or "Hall of fame" but
whatever. I have live prototype written in PHP&MySQL at
http://217.147.83.36:9000/ Example of "blame map" can be seen at
http://217.147.83.36:9000/history::171 two blame maps compared
http://217.147.83.36:9000/history::171=169
For some reason folks at #wikimedia-tech. were mainly concerned with
speed and almost nothing else so I'll try explaining performance
issues as best as I can.
First of all, I DO NOT propose to recalculate diffs for all zillions
of edits Wikipedia already has. Diffs would only be calculated for a
new edits.
Next, I want to explain in detail how I see this working. So first I
propose to modify revision table and add a flag with following
possible values: "Revision is too old to be diffed", "Revision is
awaiting to be diffed", "Revision has been diffed". Also another table
should be added that will store blame maps for each revision. Blame
map for each subsequent revision will be calculated incrementally. So
it doesn't really matter whether article has 10 or 1000 revisions. We
would only need last blame map.
I also propose to have separate dedicated diff server(s) with sole job
to calculate diffs in background. I.e. diff server grabs revision with
"Revision is awaiting to be diffed" flag and last blame map from
database, calculates diff and finally stores new blame map in the
database and also changes revision flag to "Revision has been diffed".
Repeat.
In addition, article display logic should be altered. The module that
displays article should check diff flag. If diff flag is set to
"Revision is too old to be diffed" no further changes needed. If diff
flag is set to "Revision is awaiting to be diffed" then Credits
section should be created that only contains message "Calculation in
progress". If diff flag is set to "Revision has been diffed" then
Credits section should be created that contains list of contributors
ordered by contribution size. The list of contributors in correct
order can be generated with a single select to blame map table. In
addition this select can be cached. Direct link to blame map should be
displayed too. If user clicks on this blame map link corresponding
blame map should be presented. Every blame map can be generated with a
single select and can be placed in cache. Yawn
If you are still awake by now more thoughts on fault tolerance here.
Should diff server die, crash, fail or whatever the only side effect
end user will see is "Calculation in progress" message right after
article body. That's it. No slowdown or anything. If user still wants
see some kind of diff he/she can still use old diff engine. Because
blame maps aren't calculated in real time this feature is impractical
target for DoS attacks. However I should point out that any real time
diff algorithm is one big fat target for DoS attacks on other wikis
which are run on single server without some sort of acceleration.
There is also small Unicode issue. Due to crappy utf-8 support in PHP
all non-latin characters are currently ignored. I believe this could
be solved either by enabling proper Unicode support in PHP or writing
custom code to separate words. But before that I propose to test on
English Wikipedia first because if it will works for English it should
work for other languages.
So I offer following practical steps. Dedicate one of servers to be
diff playground. I will need a shell account on this server. Install
mediawiki on it alongside with diff logic running in background.
Create read only mysql account on live database server. So as a result
this diff server can grab new revisions from live database, diff them
and store results locally. This way we can find out how many edits
single server can process and see how many servers this feature will
require in total (I don't think it will be more than 2-3 though).
In conclusion, I'd like to say that in my opinion this feature will be
useful and practical if implemented. It also can be crucial building
block for other interesting features. However, I want to stress that
I'm not interested in doing this *unless* it is used in English
Wikipedia and I'm given appropriate credit. I can give a reason why I
want that in private e-mail.
Thank you for reading this long and boring e-mail.

New diff feature for Wikipedia