On Wed, Jan 7, 2009 at 6:25 PM, Keisial keisial@gmail.com wrote:
Anthony was doing some similar work in October, based on RCS and reverse deltas. Results also reduced the sizes notably.
http://blog.p2pedia.org/2008/10/anarchism.html
Using skip-deltas I think you could make a system fast enough to run live. At the very least it could be used as part of an incremental dump system. Using *smart* skip-deltas, you'd resolve the inefficiencies due to page-blanking vandalism.
I never managed to create an implementation, though. In particular, I [oddly] couldn't find a standalone diff program to make the deltas in RCS format, and never got around to writing my own.
One improvement over the diff format used by RCS would be to use smarter breakpoints, since wikitext tends to have a lot of really long lines with no line breaks. Using some simple heuristics to guess at sentence breaks would probably be useful there. It wouldn't have to be perfect, since reconstruction just concatenates all the text together regardless of where it was broken apart. Of course, results would be somewhat language specific.
I like Robert's approach, though. I'd love to see the results of applying the concept of skip-deltas to it, to achieve O(lg(N)) random access reads (with N being the number of revisions to the article being accessed). Or maybe I'm misreading him and he's already accomplished that?