Tomasz Wegrzanowski wrote:
Quick field report:
I tried to use Algorithm::Diff on split("",$str)'ed contents of 2 revisions, and for bigger (like 30kB) pages it takes a few minutes. This is weird.
How is MediaWiki doing it ? Does it try to find longest common subsequence of lines or characters ?
The LCS algorithm operates on arrays, it's called an array of lines internally but it could be pretty much anything, all you need is a valid comparison operator. The formatter first does a line level diff, to identify paragraphs which have been moved or deleted. It identifies which lines have been changed, for some definition of changed. Then it splits each of the changed lines into words (or characters, in the case of Chinese and Japanese), and does a diff on the word arrays.
An important exception which I implemented a couple of weeks ago, is to impose a maximum line length. Lines above the maximum are treated as a single word. Some kinds of vandalism involve very long lines (~1 MB), and these diffs used to fail. Brion ported this change to the C++ module, since both versions suffered from the same problem. Except the C++ version segfaulted whereas the PHP version just gave an out of memory error.
It would be nice to have a second exception for a large number of lines requiring word-level diffs: this case also causes the current PHP algorithm to fail.
-- Tim Starling