On 3/23/06, Ilmari Karonen nospam@vyznev.net wrote:
And what happens if the next edit merges some content back in from the reverted text?
This case falls under "not perfect but as close as can be". It's essentially the same problem as someone pasting content from another article, or from another source entirely. Even your diff-based scheme, while nifty indeed, doesn't solve that. In general, nothing can.
Well actually it does... because I proposed only classifying articles which are completely disconnected from the main sub-graph as non-contributors. The revert+remerge will either end up in the entropy flow shortest path (if the removed text is smaller than the preserved text), or as a little stub hanging off the main history flow pathway should the diff to the reverted version be smaller.
By the way, it might be possible to optimize your scheme by using some form of histogram analysis to quickly establish lower bounds on edit distances. For that matter, if you're not using it already, even just the difference in article lengths gives a weak lower bound on the edit distance. Meanwhile, hashing can be used to establish upper bounds, both by hashing the entire text to detect exact reversions and by hashing deterministically chosen chunks (such as article sections) to detect local changes.
Oh sure, I wasn't really going for history detection.. I've been working on the optimal storage of all revisions. (BTW, en wikipedia revisions fit just fine in 6gb of ram stored with windowed optimum deltas, and I'm expecting that I'll get it under 4gb eventually...). I only mentioned it because it's the only automated method I've found thus far which I believe managed to remove a non-trivial amount of pure vandalsim without culling valid edits.
If you really want to have fun with tracing contributions, you end up in a rathole of computational linguistics trying to detect equivalent text. It's a fun game, but I don't believe the state of the art is advanced enough to produce anything but highly misleading results.
Somewhat offtopic, but another fun game to play is making english Wikipedia parsable by link grammar (http://www.link.cs.cmu.edu/link/). Linkgrammar is good enough that text which is completely unparsable is generally bad grammar that needs to be fixed. (I say almost because it is confused by a lack of serial-comma, which also messes up my meat based parser, but I understand that it's a matter which is widely debated). Unfortunately LG only produces a fully sane parsing in about 80% of the cases where it thinks it has one... which makes it not so useful for my dream of a bot which suggests places to attach citation requests. :)