On 25/01/2008, Shmuel Weidberg ezrawax@gmail.com wrote:
Thinking about what would be involved, it seems that it would be very difficult to sift through hundreds of edits to determine who wrote what. Anybody have any ideas about how to make it a manageable task?
It's not difficult, but because of vandalisation blanking issues in principle you need to go through the entire history of the article to find out the earliest time that a particular edit/paragraph/sentence/string of characters appeared, which you could probably best check using hashing. It would be time consuming, it's an O(N*M) problem per article where N is the number of history items and M is the size of the final article.
But you could do it a different way where you go back through the history in large jumps (binary search if you know what that is) until a particular contribution disappeared, and then sniff around checking to see that the disappearance at that point in the history wasn't just a temporary blanking. That would be O(log(N)*M) or better.
There would still be issues though, if somebody correct spelling, then it would look like they wrote that bit entirely to the dumb program, whereas a human would probably still credit the original guy mostly, but there might be ways around that too by checking the percentage change or something.
It seems doable.
Regards,
Ezra