On Thu, Jan 8, 2009 at 7:29 AM, Nikola Smolenski <smolensk(a)eunet.yu> wrote:
Anthony wrote:
Using skip-deltas I think you could make a system
fast enough to run
live.
At the very least it could be used as part of an
incremental dump system.
Using *smart* skip-deltas, you'd resolve the inefficiencies due to
page-blanking vandalism.
One more possibility is to make md5 of every revision, then diff only
between those that have unique md5s.
My thought was that each stored revision would have two pieces of data, the
diff and the rev id of the prior version on which it was based. By default
the prior revision would be chosen using the standard algorithm to ensure
lg(N) chains (see
http://svn.collab.net/repos/svn/trunk/notes/skip-deltas),
but in the case of an exact match with a prior revision you'd set the prior
version to that exact match and the diff would be blank (for performance
you'd want to throw in some checks to handle multiple reverts while keeping
the chains as short as possible). This would probably be much easier with
FSFS deltas (a la Subversion), but you could keep one (or a few) of the most
recent revisions in full to handle the normal case efficiently.
MD5s could be kept to make the search for exact duplicates fast, but the
advantage of this over just doing MD5s is that it could be extended to cover
near-exact matches (e.g. an optimizer could run in the background finding
near-duplicates and rearranging the chains appropriately). It also would
handle the case of an MD5 match on non-matching text without special-casing,
and this definitely needs to be handled because while an MD5 match is
unlikely in random data you also have to worry about intentional attacks
since MD5 is cryptographically insecure.
One improvement over the diff format used by RCS would
be to use smarter
breakpoints, since wikitext tends to have a lot
of really long lines with
no
line breaks. Using some simple heuristics to
guess at sentence breaks
would
probably be useful there. It wouldn't have
to be perfect, since
I suggest looking into wdiff (
http://www.gnu.org/software/wdiff/ ).
I'll have to think about whether or not wdiff would work with delta
combining, which is part of the key to ensuring fast lookup of prior
revisions.
Of course, this is probably all just a pipe dream for me - barring the
discovery of tools that do exactly what I need, I don't have the time and/or
the money to watch this get implemented.
Subversion does most of this, and tweaking Subversion to be more wiki-aware
would be another possibility, but Subversion is incredibly bloated for this
purpose. Plus I'd probably have to rewrite Mediawiki. Mediawiki doesn't
allow easily pluggable "alternative databases", does it? (I seem to recall
the database access not being very well encapsulated in MW.)