Re: [Wikitech-l] Revision history edit compressor

8 Jan 2009


      On Thu, Jan 8, 2009 at 7:29 AM, Nikola Smolenski smolensk@eunet.yu wrote:
...
Anthony wrote:
...
Using skip-deltas I think you could make a system fast enough to run
live.
...
At the very least it could be used as part of an incremental dump system.
Using *smart* skip-deltas, you'd resolve the inefficiencies due to
page-blanking vandalism.
One more possibility is to make md5 of every revision, then diff only
between those that have unique md5s.
My thought was that each stored revision would have two pieces of data, the
diff and the rev id of the prior version on which it was based.  By default
the prior revision would be chosen using the standard algorithm to ensure
lg(N) chains (see http://svn.collab.net/repos/svn/trunk/notes/skip-deltas),
but in the case of an exact match with a prior revision you'd set the prior
version to that exact match and the diff would be blank (for performance
you'd want to throw in some checks to handle multiple reverts while keeping
the chains as short as possible).  This would probably be much easier with
FSFS deltas (a la Subversion), but you could keep one (or a few) of the most
recent revisions in full to handle the normal case efficiently.
MD5s could be kept to make the search for exact duplicates fast, but the
advantage of this over just doing MD5s is that it could be extended to cover
near-exact matches (e.g. an optimizer could run in the background finding
near-duplicates and rearranging the chains appropriately).  It also would
handle the case of an MD5 match on non-matching text without special-casing,
and this definitely needs to be handled because while an MD5 match is
unlikely in random data you also have to worry about intentional attacks
since MD5 is cryptographically insecure.
...
One improvement over the diff format used by RCS would be to use smarter
...
breakpoints, since wikitext tends to have a lot of really long lines with
no
...
line breaks.  Using some simple heuristics to guess at sentence breaks
would
...
probably be useful there.  It wouldn't have to be perfect, since
I suggest looking into wdiff ( http://www.gnu.org/software/wdiff/ ).
I'll have to think about whether or not wdiff would work with delta
combining, which is part of the key to ensuring fast lookup of prior
revisions.
Of course, this is probably all just a pipe dream for me - barring the
discovery of tools that do exactly what I need, I don't have the time and/or
the money to watch this get implemented.
Subversion does most of this, and tweaking Subversion to be more wiki-aware
would be another possibility, but Subversion is incredibly bloated for this
purpose.  Plus I'd probably have to rewrite Mediawiki.  Mediawiki doesn't
allow easily pluggable "alternative databases", does it?  (I seem to recall
the database access not being very well encapsulated in MW.)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Revision history edit compressor