When a new paragraph was inserted, diff doesn't discover that the previous first paragraph is now the second. The diff reports much larger changes than actually happened. Why is that? How can it be fixed?
I'm talking about Wikipedia now. Are there different implementations of diff in various instances of MediaWiki? How is it implemented? Using UNIX/Linux diff, wdiff, or some other algorithm?
Here is an example, where a bullet list of works (discography) was enhanced, http://sv.wikipedia.org/w/index.php?title=Staffan_M%C3%A5rtensson&diff=1...
As you can see, Brahms Clarinet Sonatas were pushed from 1st to 2nd position, but is reported by diff as a total change. Instead the record label (Channel Sound) is reported as unchanged text. Yes, the phrase "med Erik Lanninger" was also changed to "Med E Lanninger", but that is a much smaller change than the one reported.
At my website runeberg.org, where scanned books are proofread, I have implemented the diff function using wdiff with some extra features. An example is shown here, http://runeberg.org/rc.pl?action=diff&src=nfbf/0734
Since a common edit is to change "word" to "<b>word</b>", I want changes in XML-like markup to be reported separately, which you can see is the case at the bottom of that diff. But wdiff looks strictly at whitespace, so I had to modify this. The quite naive and non-optimized (but working) Perl code looks like this (yes, versions are maintained by plain old RCS):
# A change from "foo bar" to "<b>foo bar" is seen by wdiff as a # change of the word "foo" into "<b>foo". But we want to see this # as the addition of the HTML/XML tag "<b>". To this effect, we # pad spaces around all "<" and ">" in the original text versions, # i.e. " <b> foo bar" before calling wdiff. The output from wdiff # will be " <span><b></span> foo bar", where the padding spaces # are outside of the <span> tags. This has to be taken into # consideration when removing the space padding, below.
my $cmd = "umask 2" . " && co -p1.$rev1 $filename 2>/dev/null | sed 's/</ </g;s/>/> /g' >$tmp1" . " && co -p1.$rev2 $filename 2>/dev/null | sed 's/</ </g;s/>/> /g' >$tmp2" . " && wdiff -n -s -w '<span class="del">' -x '</span>' " . " -y '<span class="ins">' -z '</span>' $tmp1 $tmp2 |"; if (open(FILE, $cmd)) { local $/ = undef; $diff = <FILE>; close(FILE); } else { debug_log("rc.pl: Failed with $cmd"); } $diff = html_encode($diff);
Hope this was helpful.
On Wed, Nov 18, 2009 at 12:42 PM, Lars Aronsson lars@aronsson.se wrote:
I'm talking about Wikipedia now. Are there different implementations of diff in various instances of MediaWiki? How is it implemented? Using UNIX/Linux diff, wdiff, or some other algorithm?
It's implemented out of the box in PHP, with a PHP extension written in C++ available for better speed.
http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/diff/ http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/wikidiff2/
wdiff isn't reliably available -- it's not installed by default on all (any?) Linux distros, and it's very unlikely to be installed on non-Linux servers. Moreover, even where it's installed, shared hosts often don't give PHP scripts the right to execute external programs -- that breaks out of PHP's sandboxes, and many shared hosts rely on those instead of Unix permissions.
Given all that, we need a PHP implementation of some kind. And once we have that, and need a faster version in C++ or such, I guess the logic goes that we may as well use the same algorithm for the sake of consistency. I don't know, wikidiff2 was written several months before I started MediaWiki development.
Дана Wednesday 18 November 2009 19:59:58 Aryeh Gregor написа:
wdiff isn't reliably available -- it's not installed by default on all (any?) Linux distros, and it's very unlikely to be installed on non-Linux servers. Moreover, even where it's installed, shared hosts often don't give PHP scripts the right to execute external programs -- that breaks out of PHP's sandboxes, and many shared hosts rely on those instead of Unix permissions.
From what I recall I've seen while browsing through its source, wdiff just transforms every space in a file into a newline, then runs diff on it. This could be simulated in PHP too.
On Wed, Nov 18, 2009 at 2:22 PM, Nikola Smolenski smolensk@eunet.rs wrote:
From what I recall I've seen while browsing through its source, wdiff just transforms every space in a file into a newline, then runs diff on it. This could be simulated in PHP too.
diff isn't reliably available either. It won't be present on Windows, and will often be inaccessible on Unix (because of exec() being disabled or such).
Дана Wednesday 18 November 2009 20:19:45 Aryeh Gregor написа:
On Wed, Nov 18, 2009 at 2:22 PM, Nikola Smolenski smolensk@eunet.rs wrote:
From what I recall I've seen while browsing through its source, wdiff just transforms every space in a file into a newline, then runs diff on it. This could be simulated in PHP too.
diff isn't reliably available either. It won't be present on Windows, and will often be inaccessible on Unix (because of exec() being disabled or such).
But the reformatted text could be diffed the same way ordinary text is now.
On Wed, Nov 18, 2009 at 2:28 PM, Nikola Smolenski smolensk@eunet.rs wrote:
But the reformatted text could be diffed the same way ordinary text is now.
Yes, of course we can change the diff algorithm if we want. It's in our SVN repo. That doesn't really have to do with diff or wdiff, though.
Дана Wednesday 18 November 2009 20:25:17 Aryeh Gregor написа:
On Wed, Nov 18, 2009 at 2:28 PM, Nikola Smolenski smolensk@eunet.rs wrote:
But the reformatted text could be diffed the same way ordinary text is now.
Yes, of course we can change the diff algorithm if we want. It's in our SVN repo. That doesn't really have to do with diff or wdiff, though.
We don't need to change the diff algorithm, we could simply preformat the text the same way wdiff does.
Lars Aronsson wrote:
When a new paragraph was inserted, diff doesn't discover that the previous first paragraph is now the second. The diff reports much larger changes than actually happened. Why is that? How can it be fixed?
This is one of those ancient bugs: https://bugzilla.wikimedia.org/show_bug.cgi?id=5072
I'm talking about Wikipedia now. Are there different implementations of diff in various instances of MediaWiki? How is it implemented? Using UNIX/Linux diff, wdiff, or some other algorithm?
MediaWiki seems to be using its own PHP diff called "DifferenceEngine" (includes/diff/DifferenceEngine.php, in the same directory there is also a Diff.php which includes a class "WikiDiff3"). However, it is possible to user other Diff Engines like GNU Diff/Diff3.
The config file of Wikimedia's setup suggest that Wikipedia is using the wikidiff2 engine http://noc.wikimedia.org/conf/highlight.php?file=CommonSettings.php http://www.mediawiki.org/wiki/Extension:Wikidiff2
Regards,
Church of emacs
wikitech-l@lists.wikimedia.org