Jakob Voss wrote:
Hi!
Can anybody tell my what differencing algorithm MediaWiki uses? DifferenceEngine.php starts with "see diff.doc" - a file I cannot find. As far as I know it has been taken from phpwiki but is it still the same algorithm and what algorithm does phpwiki exactely use? I have not found any better references on differencing algorithms but the papers listed at http://en.wikipedia.org/wiki/Diff#References. There seems to be some new research in differencing XML-files. Is MedaWiki's differencing algorithm based on E. Myers's algorithm (1986)?
MediaWiki's diff code was mostly written by Geoffrey Dairiki, for PhpWiki. It was imported into MediaWiki by Magnus and has since been modified by a few MediaWiki developers. It's a lowest-common subsequence (LCS) algorithm, and is "mostly lifted" from a Perl module by Ned Konz, now called Algorithm::DiffOld. It's called DiffOld because the new Algorithm::Diff is faster. Maybe if someone's feeling ambitious, they could port the new perl module. I was wondering if those string comparisons were the slow part, and it seems the author of the new Algorithm::Diff package found they were.
There is indeed a C++ version, but we don't use it. Brion had a bit of trouble installing it, and based on the performance statistics, it's not urgent. We'll get it working eventually.
-- Tim Starling