Re: [Wikitech-l] Re: Differencing Engine

14 Sep 2005


      On Wed, Sep 14, 2005 at 11:04:17PM +1000, Tim Starling wrote:
...
The LCS algorithm operates on arrays, it's called an array of lines
internally but it could be pretty much anything, all you need is a valid
comparison operator. The formatter first does a line level diff, to
identify paragraphs which have been moved or deleted. It identifies
which lines have been changed, for some definition of changed. Then it
splits each of the changed lines into words (or characters, in the case
of Chinese and Japanese), and does a diff on the word arrays.
An important exception which I implemented a couple of weeks ago, is to
impose a maximum line length. Lines above the maximum are treated as a
single word. Some kinds of vandalism involve very long lines (~1 MB),
and these diffs used to fail. Brion ported this change to the C++
module, since both versions suffered from the same problem. Except the
C++ version segfaulted whereas the PHP version just gave an out of
memory error.
It would be nice to have a second exception for a large number of lines
requiring word-level diffs: this case also causes the current PHP
algorithm to fail.
I'm doing character-level diffs in my bot, for deep reverts,
and I found that printing two revisions to files, one character per line
(escaped of course), running GNU diff, and parsing the result,
is extremely fast.
It took Algorithm::Diff about 7 minutes to do what GNU diff did
in less than 1 second.
So GNU diff must be doing the right thing that we should simply copy,
instead of recoding Algorithm::Diff in C++.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Re: Differencing Engine