On Wed, Jan 7, 2009 at 6:25 PM, Keisial <keisial(a)gmail.com> wrote:
Anthony was doing some similar work in October, based
on RCS and reverse
deltas. Results also reduced the sizes notably.
http://blog.p2pedia.org/2008/10/anarchism.html
Using skip-deltas I think you could make a system fast enough to run live.
At the very least it could be used as part of an incremental dump system.
Using *smart* skip-deltas, you'd resolve the inefficiencies due to
page-blanking vandalism.
I never managed to create an implementation, though. In particular, I
[oddly] couldn't find a standalone diff program to make the deltas in RCS
format, and never got around to writing my own.
One improvement over the diff format used by RCS would be to use smarter
breakpoints, since wikitext tends to have a lot of really long lines with no
line breaks. Using some simple heuristics to guess at sentence breaks would
probably be useful there. It wouldn't have to be perfect, since
reconstruction just concatenates all the text together regardless of where
it was broken apart. Of course, results would be somewhat language
specific.
I like Robert's approach, though. I'd love to see the results of applying
the concept of skip-deltas to it, to achieve O(lg(N)) random access reads
(with N being the number of revisions to the article being accessed). Or
maybe I'm misreading him and he's already accomplished that?