Tomasz Wegrzanowski wrote:
Quick field report:
I tried to use Algorithm::Diff on split("",$str)'ed contents of 2
revisions,
and for bigger (like 30kB) pages it takes a few minutes. This is weird.
How is MediaWiki doing it ? Does it try to find longest common subsequence
of lines or characters ?
The LCS algorithm operates on arrays, it's called an array of lines
internally but it could be pretty much anything, all you need is a valid
comparison operator. The formatter first does a line level diff, to
identify paragraphs which have been moved or deleted. It identifies
which lines have been changed, for some definition of changed. Then it
splits each of the changed lines into words (or characters, in the case
of Chinese and Japanese), and does a diff on the word arrays.
An important exception which I implemented a couple of weeks ago, is to
impose a maximum line length. Lines above the maximum are treated as a
single word. Some kinds of vandalism involve very long lines (~1 MB),
and these diffs used to fail. Brion ported this change to the C++
module, since both versions suffered from the same problem. Except the
C++ version segfaulted whereas the PHP version just gave an out of
memory error.
It would be nice to have a second exception for a large number of lines
requiring word-level diffs: this case also causes the current PHP
algorithm to fail.
-- Tim Starling