2009/1/8 Brion Vibber brion@wikimedia.org:
Definitely of interest! If you haven't already, I'd love to see some documentation on the format on mediawiki.org, and it'd be great if we
I did some similar work a while ago using Python's difflib[1] as the diffing engine. Since difflib was much too slow when feeding it lists of single characters, I broke up the articles into sequences of words which improved the speed dramatically (but it's still not as fast as Robert's).
My goal was slightly different, and rather than producing exact revision deltas I was looking for "blame" information[2]. I also computed the SHA1-matching graph of reverts, which calculates the shortest path between the current revision and the first one, consequently skipping over page-blanking events in most cases.
The output for the first 1400 or so articles in enwiki can be found here: http://hewgill.com/~greg/wikiblame/
I would be interested in adapting my blame processor to use a faster diffing algorithm, since it took my machine many hours to process those 1400 articles.
[1]: http://python.org/doc/2.5/lib/module-difflib.html [2]: http://hewgill.com/journal/entries/461-wikipedia-blame
Greg Hewgill http://hewgill.com