Hi!
Can anybody tell my what differencing algorithm MediaWiki uses? DifferenceEngine.php starts with "see diff.doc" - a file I cannot find. As far as I know it has been taken from phpwiki but is it still the same algorithm and what algorithm does phpwiki exactely use? I have not found any better references on differencing algorithms but the papers listed at http://en.wikipedia.org/wiki/Diff#References. There seems to be some new research in differencing XML-files. Is MedaWiki's differencing algorithm based on E. Myers's algorithm (1986)?
Thanks and greetings, Jakob
Timwi wrote:
It uses the standard Linux tool "diff", doesn't it?
No, since per-character instead of per-line granularity is needed. MediaWiki ships its own PHP diff implementation. A tremendously faster-performing C++ implementation is available in CVS along with a SWIG interface, module 'wikidiff', which is in use at Wikimedia sites.
Jakob, I imagine the C++ code would be the quickest way to see how the diff is done. It's been a while since I looked at it, so I don't recall the algorithm used anymore.
-IK
Ivan Krstic wrote:
Timwi wrote:
It uses the standard Linux tool "diff", doesn't it?
No, since per-character instead of per-line granularity is needed.
That is not mutually exclusive. I have done such a thing before, using the standard Linux tool "diff".
MediaWiki ships its own PHP diff implementation. A tremendously faster-performing C++ implementation is available in CVS along with a SWIG interface, module 'wikidiff', which is in use at Wikimedia sites.
Alright, I didn't know that.
Timwi
Jakob Voss wrote:
Hi!
Can anybody tell my what differencing algorithm MediaWiki uses? DifferenceEngine.php starts with "see diff.doc" - a file I cannot find. As far as I know it has been taken from phpwiki but is it still the same algorithm and what algorithm does phpwiki exactely use? I have not found any better references on differencing algorithms but the papers listed at http://en.wikipedia.org/wiki/Diff#References. There seems to be some new research in differencing XML-files. Is MedaWiki's differencing algorithm based on E. Myers's algorithm (1986)?
MediaWiki's diff code was mostly written by Geoffrey Dairiki, for PhpWiki. It was imported into MediaWiki by Magnus and has since been modified by a few MediaWiki developers. It's a lowest-common subsequence (LCS) algorithm, and is "mostly lifted" from a Perl module by Ned Konz, now called Algorithm::DiffOld. It's called DiffOld because the new Algorithm::Diff is faster. Maybe if someone's feeling ambitious, they could port the new perl module. I was wondering if those string comparisons were the slow part, and it seems the author of the new Algorithm::Diff package found they were.
There is indeed a C++ version, but we don't use it. Brion had a bit of trouble installing it, and based on the performance statistics, it's not urgent. We'll get it working eventually.
-- Tim Starling
Tim Starling wrote:
MediaWiki's diff code was mostly written by Geoffrey Dairiki, for PhpWiki. It was imported into MediaWiki by Magnus and has since been modified by a few MediaWiki developers. It's a lowest-common subsequence (LCS) algorithm, and is "mostly lifted" from a Perl module by Ned Konz, now called Algorithm::DiffOld. It's called DiffOld because the new Algorithm::Diff is faster. Maybe if someone's feeling ambitious, they could port the new perl module. I was wondering if those string comparisons were the slow part, and it seems the author of the new Algorithm::Diff package found they were.
As far as I could see Algorithm::Diff uses the classic longest-common subsequence algorithm by Dan Hirschberg with O(n*m) where n and m is the length of each text and linear space: http://www.ics.uci.edu/~eppstein/161/960229.html
There is also an algorithm by Eugene W. Myers with O((n+m)*D) where D is the size of the difference between the texs: http://citeseer.ist.psu.edu/myers86ond.html
Davide Libenzi implemented it in C: http://www.xmailserver.org/xdiff-lib.html
Maybe diffs are not the most important performance issue but surely they can be speed up.
Greetings, Jakob
Jakob Voss wrote:
Maybe diffs are not the most important performance issue but surely they can be speed up.
Yes, it would certainly a useful project if anyone wants to work on it. The gains would be comparable to any other performance project on this scale. The C++ module could do with some optimisation, and the PHP equivalent will continue to be used by external users if not by Wikimedia. And they would be useful projects for standardisation and submission to PECL and PEAR, IMHO.
-- Tim Starling
On Tue, Sep 13, 2005 at 11:44:41AM +0200, Jakob Voss wrote:
MediaWiki's diff code was mostly written by Geoffrey Dairiki, for PhpWiki. It was imported into MediaWiki by Magnus and has since been modified by a few MediaWiki developers. It's a lowest-common subsequence (LCS) algorithm, and is "mostly lifted" from a Perl module by Ned Konz, now called Algorithm::DiffOld. It's called DiffOld because the new Algorithm::Diff is faster. Maybe if someone's feeling ambitious, they could port the new perl module. I was wondering if those string comparisons were the slow part, and it seems the author of the new Algorithm::Diff package found they were.
As far as I could see Algorithm::Diff uses the classic longest-common subsequence algorithm by Dan Hirschberg with O(n*m) where n and m is the length of each text and linear space: http://www.ics.uci.edu/~eppstein/161/960229.html
Quick field report:
I tried to use Algorithm::Diff on split("",$str)'ed contents of 2 revisions, and for bigger (like 30kB) pages it takes a few minutes. This is weird.
How is MediaWiki doing it ? Does it try to find longest common subsequence of lines or characters ?
I tried to use Algorithm::Diff on split("",$str)'ed contents of 2 revisions, and for bigger (like 30kB) pages it takes a few minutes. This is weird.
How is MediaWiki doing it ? Does it try to find longest common subsequence of lines or characters ?
Well, MediaWiki surely doesn't consider each character singly, but rather whole words. I don't know if that's it, but seeing as the algorithm is O(n*m), it seems to make some sense.
Tomasz Wegrzanowski wrote:
Quick field report:
I tried to use Algorithm::Diff on split("",$str)'ed contents of 2 revisions, and for bigger (like 30kB) pages it takes a few minutes. This is weird.
How is MediaWiki doing it ? Does it try to find longest common subsequence of lines or characters ?
The LCS algorithm operates on arrays, it's called an array of lines internally but it could be pretty much anything, all you need is a valid comparison operator. The formatter first does a line level diff, to identify paragraphs which have been moved or deleted. It identifies which lines have been changed, for some definition of changed. Then it splits each of the changed lines into words (or characters, in the case of Chinese and Japanese), and does a diff on the word arrays.
An important exception which I implemented a couple of weeks ago, is to impose a maximum line length. Lines above the maximum are treated as a single word. Some kinds of vandalism involve very long lines (~1 MB), and these diffs used to fail. Brion ported this change to the C++ module, since both versions suffered from the same problem. Except the C++ version segfaulted whereas the PHP version just gave an out of memory error.
It would be nice to have a second exception for a large number of lines requiring word-level diffs: this case also causes the current PHP algorithm to fail.
-- Tim Starling
On Wed, Sep 14, 2005 at 11:04:17PM +1000, Tim Starling wrote:
The LCS algorithm operates on arrays, it's called an array of lines internally but it could be pretty much anything, all you need is a valid comparison operator. The formatter first does a line level diff, to identify paragraphs which have been moved or deleted. It identifies which lines have been changed, for some definition of changed. Then it splits each of the changed lines into words (or characters, in the case of Chinese and Japanese), and does a diff on the word arrays.
An important exception which I implemented a couple of weeks ago, is to impose a maximum line length. Lines above the maximum are treated as a single word. Some kinds of vandalism involve very long lines (~1 MB), and these diffs used to fail. Brion ported this change to the C++ module, since both versions suffered from the same problem. Except the C++ version segfaulted whereas the PHP version just gave an out of memory error.
It would be nice to have a second exception for a large number of lines requiring word-level diffs: this case also causes the current PHP algorithm to fail.
I'm doing character-level diffs in my bot, for deep reverts, and I found that printing two revisions to files, one character per line (escaped of course), running GNU diff, and parsing the result, is extremely fast.
It took Algorithm::Diff about 7 minutes to do what GNU diff did in less than 1 second.
So GNU diff must be doing the right thing that we should simply copy, instead of recoding Algorithm::Diff in C++.
Tomasz Wegrzanowski wrote:
I'm doing character-level diffs in my bot, for deep reverts, and I found that printing two revisions to files, one character per line (escaped of course), running GNU diff, and parsing the result, is extremely fast.
It took Algorithm::Diff about 7 minutes to do what GNU diff did in less than 1 second.
So GNU diff must be doing the right thing that we should simply copy, instead of recoding Algorithm::Diff in C++.
I had a look into the GNU diffutils documentation. Currently GNU diff is based on the O((n+m)*D) algorithm by Myers. I think Algorith::Diff uses the the classic O(n*m) algorithm.
Greetings, Jakob
wikitech-l@lists.wikimedia.org