On Tue, Oct 21, 2008 at 12:33 AM, Nikola Smolenski smolensk@eunet.yu wrote:
On Saturday 18 October 2008 14:57:59 Daniel Kinzler wrote:
So, what would it take? Where could we try it? what are the concerns?
FWIW, copying my email to M. Schneider:
IIRC, on Wikimania you talked about the problem of how to identify primary authors of articles, so I wanted to share my thoughts on this.
The obvious first step is to go through all the revisions and get MD5 of each; then, use MD5s to isolate and disregard edits that have been reverted.
To measure difference between two edits, I mentioned you that wdiff ( http://www.gnu.org/software/wdiff/ ) could be used: simply count number of changed words in the article. Wdiff could give false positives (an author that merely switches two paragraphs will appear to be a major author), but could not give false negatives (an author who changes a single word really did just change a single word; of course, such a change may be very important, but isn't major, or, IMO, copyrightable).
More sophisticated diffs could also be introduced. For example, it would be relatively simple to make a program that tries to find if an author has switched two (or more) paragraphs, then apply a diff program as if they haven't been switched.
or totally disregard order cat article | sed -e 's/( |\t)/\n/g' | sort