On Tue, Oct 21, 2008 at 12:33 AM, Nikola Smolenski <smolensk(a)eunet.yu> wrote:
On Saturday 18 October 2008 14:57:59 Daniel Kinzler
wrote:
So, what would it take? Where could we try it?
what are the concerns?
FWIW, copying my email to M. Schneider:
IIRC, on Wikimania you talked about the problem of how to identify primary
authors of articles, so I wanted to share my thoughts on this.
The obvious first step is to go through all the revisions and get MD5 of each;
then, use MD5s to isolate and disregard edits that have been reverted.
To measure difference between two edits, I mentioned you that wdiff (
http://www.gnu.org/software/wdiff/ ) could be used: simply count number of
changed words in the article. Wdiff could give false positives (an author
that merely switches two paragraphs will appear to be a major author), but
could not give false negatives (an author who changes a single word really
did just change a single word; of course, such a change may be very
important, but isn't major, or, IMO, copyrightable).
More sophisticated diffs could also be introduced. For example, it would be
relatively simple to make a program that tries to find if an author has
switched two (or more) paragraphs, then apply a diff program as if they
haven't been switched.
or totally disregard order
cat article | sed -e 's/( |\t)/\n/g' | sort
--
--
ℱin del ℳensaje.