On Thu, Jan 8, 2009 at 7:29 AM, Nikola Smolenski smolensk@eunet.yu wrote:
Anthony wrote:
Using skip-deltas I think you could make a system fast enough to run
live.
At the very least it could be used as part of an incremental dump system. Using *smart* skip-deltas, you'd resolve the inefficiencies due to page-blanking vandalism.
One more possibility is to make md5 of every revision, then diff only between those that have unique md5s.
My thought was that each stored revision would have two pieces of data, the diff and the rev id of the prior version on which it was based. By default the prior revision would be chosen using the standard algorithm to ensure lg(N) chains (see http://svn.collab.net/repos/svn/trunk/notes/skip-deltas), but in the case of an exact match with a prior revision you'd set the prior version to that exact match and the diff would be blank (for performance you'd want to throw in some checks to handle multiple reverts while keeping the chains as short as possible). This would probably be much easier with FSFS deltas (a la Subversion), but you could keep one (or a few) of the most recent revisions in full to handle the normal case efficiently.
MD5s could be kept to make the search for exact duplicates fast, but the advantage of this over just doing MD5s is that it could be extended to cover near-exact matches (e.g. an optimizer could run in the background finding near-duplicates and rearranging the chains appropriately). It also would handle the case of an MD5 match on non-matching text without special-casing, and this definitely needs to be handled because while an MD5 match is unlikely in random data you also have to worry about intentional attacks since MD5 is cryptographically insecure.
One improvement over the diff format used by RCS would be to use smarter
breakpoints, since wikitext tends to have a lot of really long lines with
no
line breaks. Using some simple heuristics to guess at sentence breaks
would
probably be useful there. It wouldn't have to be perfect, since
I suggest looking into wdiff ( http://www.gnu.org/software/wdiff/ ).
I'll have to think about whether or not wdiff would work with delta combining, which is part of the key to ensuring fast lookup of prior revisions.
Of course, this is probably all just a pipe dream for me - barring the discovery of tools that do exactly what I need, I don't have the time and/or the money to watch this get implemented.
Subversion does most of this, and tweaking Subversion to be more wiki-aware would be another possibility, but Subversion is incredibly bloated for this purpose. Plus I'd probably have to rewrite Mediawiki. Mediawiki doesn't allow easily pluggable "alternative databases", does it? (I seem to recall the database access not being very well encapsulated in MW.)