When a new paragraph was inserted, diff doesn't discover that the
previous first paragraph is now the second. The diff reports much
larger changes than actually happened. Why is that? How can it be
fixed?
I'm talking about Wikipedia now. Are there different
implementations of diff in various instances of MediaWiki?
How is it implemented? Using UNIX/Linux diff, wdiff, or some other
algorithm?
Here is an example, where a bullet list of works (discography) was
enhanced,
http://sv.wikipedia.org/w/index.php?title=Staffan_M%C3%A5rtensson&diff=…
As you can see, Brahms Clarinet Sonatas were pushed from 1st to
2nd position, but is reported by diff as a total change. Instead
the record label (Channel Sound) is reported as unchanged text.
Yes, the phrase "med Erik Lanninger" was also changed to "Med E
Lanninger", but that is a much smaller change than the one
reported.
At my website
runeberg.org, where scanned books are proofread,
I have implemented the diff function using wdiff with some
extra features. An example is shown here,
http://runeberg.org/rc.pl?action=diff&src=nfbf/0734
Since a common edit is to change "word" to "<b>word</b>",
I want
changes in XML-like markup to be reported separately, which you
can see is the case at the bottom of that diff. But wdiff looks
strictly at whitespace, so I had to modify this. The quite naive
and non-optimized (but working) Perl code looks like this (yes,
versions are maintained by plain old RCS):
# A change from "foo bar" to "<b>foo bar" is seen by wdiff
as a
# change of the word "foo" into "<b>foo". But we want to
see this
# as the addition of the HTML/XML tag "<b>". To this effect, we
# pad spaces around all "<" and ">" in the original text
versions,
# i.e. " <b> foo bar" before calling wdiff. The output from wdiff
# will be " <span><b></span> foo bar", where the padding
spaces
# are outside of the <span> tags. This has to be taken into
# consideration when removing the space padding, below.
my $cmd = "umask 2"
. " && co -p1.$rev1 $filename 2>/dev/null | sed 's/</
</g;s/>/> /g' >$tmp1"
. " && co -p1.$rev2 $filename 2>/dev/null | sed 's/</
</g;s/>/> /g' >$tmp2"
. " && wdiff -n -s -w '<span class=\"del\">'
-x '</span>' "
. " -y '<span class=\"ins\">' -z
'</span>' $tmp1 $tmp2 |";
if (open(FILE, $cmd)) {
local $/ = undef;
$diff = <FILE>;
close(FILE);
} else {
debug_log("rc.pl: Failed with $cmd");
}
$diff = html_encode($diff);
Hope this was helpful.
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik -
http://aronsson.se