On 3/22/06, Ilmari Karonen nospam@vyznev.net wrote:
One could ignore edits that have been reverted. Detecting reverts, in the strict sense of the word, is easy: all you need is a hash value for each revision.
Of course, this wouldn't be perfect. But it'd be as close to perfect as any automated system can be. And it _would_ skip most vandals.
And what happens if the next edit merges some content back in from the reverted text? This happens, perhaps not frequently but it isn't that rare either. You can't assume that a edit didn't count just because it was reverted.
I have a tool I call 'entropy flow' which I've been using for my "wikipedia text backend where all revisions fit in ram" project. I'll get around to releasing some stuff sooner or later, but it's easy enough to code yourself. It's a useful too to find out who were the real contributors...
It works like this, take all n revisions of an article:
*compress with a whole word replacement memoryless compression algorithm (this gets around 2:1 compression on enwiki text and it's "diff invarent") where common words are replaced with high byte huffman encoded keys from a pre computed dictionary (unicode properly escaped). (this is important because long words along aren't signs of greater contribution but the addition of more rare words probably is) *using xdelta3, compute the n^2 compressed binary diffs (n*(n-1+the null article)) between all versions. Yes, this is a number of diffs easily in the millions. *The diff sizes are the costs between the article nodes. Add a slight penalty to diffs that run in reverse direction to break ties. *Compute the minimum spanning tree between all these nodes. (I use the MST algo in the parallel boost graph Library http://www.osl.iu.edu/research/pbgl/ ... MST is polynomial time, but it's still expensive on millions of edges) *The users who have made edits to the nodes in the large subgraph containing the current revision are the contributors. All the little subgraphs hanging off the null article node are mostly vandalism and nonsense.
What seems to be a spookily correct ranking of contributors can be found by totaling the diff sizes of each user in the main subgraph and sorting by size.
Of course, no warranties that this actually tells you the contributors.. the best way to be safe is just to list all editors. I created this approach for finding the optimal diff order for delta compressing articles... this application is just a cute side effect. (When I'm actually compressing articles I just use a sliding window of 100ish revisions, much cheaper to compute and pretty much just as good, although it sometimes misses when someone reverts to an ancient version which is why it's better for finding contributors to consider the entire history in one MST pass).