On Tue, Sep 13, 2005 at 11:44:41AM +0200, Jakob Voss wrote:
MediaWiki's
diff code was mostly written by Geoffrey Dairiki, for
PhpWiki. It was imported into MediaWiki by Magnus and has since been
modified by a few MediaWiki developers. It's a lowest-common subsequence
(LCS) algorithm, and is "mostly lifted" from a Perl module by Ned Konz,
now called Algorithm::DiffOld. It's called DiffOld because the new
Algorithm::Diff is faster. Maybe if someone's feeling ambitious, they
could port the new perl module. I was wondering if those string
comparisons were the slow part, and it seems the author of the new
Algorithm::Diff package found they were.
As far as I could see Algorithm::Diff uses the classic longest-common
subsequence algorithm by Dan Hirschberg with O(n*m) where n and m is the
length of each text and linear space:
http://www.ics.uci.edu/~eppstein/161/960229.html
Quick field report:
I tried to use Algorithm::Diff on split("",$str)'ed contents of 2
revisions,
and for bigger (like 30kB) pages it takes a few minutes. This is weird.
How is MediaWiki doing it ? Does it try to find longest common subsequence
of lines or characters ?