When a new paragraph was inserted, diff doesn't discover that the previous first paragraph is now the second. The diff reports much larger changes than actually happened. Why is that? How can it be fixed?
I'm talking about Wikipedia now. Are there different implementations of diff in various instances of MediaWiki? How is it implemented? Using UNIX/Linux diff, wdiff, or some other algorithm?
Here is an example, where a bullet list of works (discography) was enhanced, http://sv.wikipedia.org/w/index.php?title=Staffan_M%C3%A5rtensson&diff=1...
As you can see, Brahms Clarinet Sonatas were pushed from 1st to 2nd position, but is reported by diff as a total change. Instead the record label (Channel Sound) is reported as unchanged text. Yes, the phrase "med Erik Lanninger" was also changed to "Med E Lanninger", but that is a much smaller change than the one reported.
At my website runeberg.org, where scanned books are proofread, I have implemented the diff function using wdiff with some extra features. An example is shown here, http://runeberg.org/rc.pl?action=diff&src=nfbf/0734
Since a common edit is to change "word" to "<b>word</b>", I want changes in XML-like markup to be reported separately, which you can see is the case at the bottom of that diff. But wdiff looks strictly at whitespace, so I had to modify this. The quite naive and non-optimized (but working) Perl code looks like this (yes, versions are maintained by plain old RCS):
# A change from "foo bar" to "<b>foo bar" is seen by wdiff as a # change of the word "foo" into "<b>foo". But we want to see this # as the addition of the HTML/XML tag "<b>". To this effect, we # pad spaces around all "<" and ">" in the original text versions, # i.e. " <b> foo bar" before calling wdiff. The output from wdiff # will be " <span><b></span> foo bar", where the padding spaces # are outside of the <span> tags. This has to be taken into # consideration when removing the space padding, below.
my $cmd = "umask 2" . " && co -p1.$rev1 $filename 2>/dev/null | sed 's/</ </g;s/>/> /g' >$tmp1" . " && co -p1.$rev2 $filename 2>/dev/null | sed 's/</ </g;s/>/> /g' >$tmp2" . " && wdiff -n -s -w '<span class="del">' -x '</span>' " . " -y '<span class="ins">' -z '</span>' $tmp1 $tmp2 |"; if (open(FILE, $cmd)) { local $/ = undef; $diff = <FILE>; close(FILE); } else { debug_log("rc.pl: Failed with $cmd"); } $diff = html_encode($diff);
Hope this was helpful.