On Tue, Sep 13, 2005 at 11:44:41AM +0200, Jakob Voss wrote:
MediaWiki's diff code was mostly written by Geoffrey Dairiki, for PhpWiki. It was imported into MediaWiki by Magnus and has since been modified by a few MediaWiki developers. It's a lowest-common subsequence (LCS) algorithm, and is "mostly lifted" from a Perl module by Ned Konz, now called Algorithm::DiffOld. It's called DiffOld because the new Algorithm::Diff is faster. Maybe if someone's feeling ambitious, they could port the new perl module. I was wondering if those string comparisons were the slow part, and it seems the author of the new Algorithm::Diff package found they were.
As far as I could see Algorithm::Diff uses the classic longest-common subsequence algorithm by Dan Hirschberg with O(n*m) where n and m is the length of each text and linear space: http://www.ics.uci.edu/~eppstein/161/960229.html
Quick field report:
I tried to use Algorithm::Diff on split("",$str)'ed contents of 2 revisions, and for bigger (like 30kB) pages it takes a few minutes. This is weird.
How is MediaWiki doing it ? Does it try to find longest common subsequence of lines or characters ?