On Sat, Mar 22, 2008 at 6:28 PM, Guy Van den Broeck <guyvdb(a)gmail.com>
wrote:
I want to get some feedback on a possible Summer
of Code project
proposal.
For last year's GSoC I created an HTML
diffing library for Daisy CMS.
The
algorithm has proven to work well and I'm
thinking of porting it to
mediawiki.
What the algorithm does is take the source of 2 pages and merge them to
visualize the diff. The code I have already does something like this:
http://users.pandora.be/guyvdb/wikipediadiff.jpg
Is this a feasible project for wikimedia? I'm personally not very
impressed
with the current "diff pages". I think
a visual diff would bring that
part
of mediawiki up to par with the rest of the
software.
I agree that inline diffs would be nicer, instead of side-by-side.
Having it an HTML-rendered diff instead of a wikitext diff is useful
to some extent, but it hides information. It seems like it would be
relatively difficult to convey the fact that templates or images were
changed, for instance, and things like comments (which must be
included in diffs for proper usability) would also be an issue. Some
mechanism would have to be devised to convey that such invisible
changes took place. Possibly you could have an option to do a
wikitext diff instead, but that doesn't seem ideal to me. Doing it
one way that works well for everyone would be best if possible.
As for performance, please note that Wikimedia uses a diff engine
written in C++. One written in PHP would probably not be acceptable
on Wikipedia, from past experience (diffing used to eat a huge amount
of CPU). Scalability is also important, within reason: [[George W.
Bush]] is 128 KiB, for instance.
note that the image overlays are probably wrong on safari but in principle
it works for images.
Templates and for instance table changes are handled to. In Daisy we chose
to display a tooltip window with an interpretation of the underlying HTML
changes. I'm sure we can find something similar tailored for the needs of
mediawiki.
If I start working on the HTML diff then I might as well add a word-for-word
source diff like I did for Daisy:
It suffers from the same performance penalty as the HTML diff but it conveys
all information present.
With respect to performance I think there are a lot of option. We can fall
back on a simpler diff when the filesize or execution time exceeds a certain
number, or the HTML diff can be an extra (experimental) link on the current
diff page.
In general, I don't think the performance concern should hold back this
project. Once we have the optimized html diff code we can decide how and
when to integrate it.