On Wed, Sep 14, 2005 at 11:04:17PM +1000, Tim Starling wrote:
The LCS algorithm operates on arrays, it's called an array of lines internally but it could be pretty much anything, all you need is a valid comparison operator. The formatter first does a line level diff, to identify paragraphs which have been moved or deleted. It identifies which lines have been changed, for some definition of changed. Then it splits each of the changed lines into words (or characters, in the case of Chinese and Japanese), and does a diff on the word arrays.
An important exception which I implemented a couple of weeks ago, is to impose a maximum line length. Lines above the maximum are treated as a single word. Some kinds of vandalism involve very long lines (~1 MB), and these diffs used to fail. Brion ported this change to the C++ module, since both versions suffered from the same problem. Except the C++ version segfaulted whereas the PHP version just gave an out of memory error.
It would be nice to have a second exception for a large number of lines requiring word-level diffs: this case also causes the current PHP algorithm to fail.
I'm doing character-level diffs in my bot, for deep reverts, and I found that printing two revisions to files, one character per line (escaped of course), running GNU diff, and parsing the result, is extremely fast.
It took Algorithm::Diff about 7 minutes to do what GNU diff did in less than 1 second.
So GNU diff must be doing the right thing that we should simply copy, instead of recoding Algorithm::Diff in C++.