Netocrat wrote:
On Mon, 12 Sep 2005 01:56:53 -0700, Brion Vibber
wrote:
That's exactly the sort of info I was looking for. Was any attempt made
to compress the diffs? I would be interested to know how the result
compared for compression and overall speed to the compressed concatenated
revisions.
No, no work has been done along these lines that I'm aware of.
The three main reasons to find an improvement to rcs
diffs were stated as:
* moved paragraphs
* reverted edits
* minor changes within a line
The 1st and 3rd could be handled by a customised diff format and the 2nd
could be handled by links in the database - have those possibilities been
considered and what pros/cons are there to this approach vs the current
compression scheme?
No, these possibilities have not been rigorously examined. Note that
those aren't really reasons. They're illustrative only, addressing each
of them in turn does not guarantee that your compression algorithm is
effective. I was just describing my train of thought in arriving at the
idea that LZ77 might be worth a try. If you have your own idea, please
download a dump and try it out.
The main thing which put me off implementing diff-based compression was
the complexity, in particular the required schema change. If you need to
load some large number of diffs in order to generate a revision, those
diffs need to be loaded in a single database query, if any kind of
efficiency is to be reached.
In other words, don't do a proof of principle and then nag me to write
the real thing, as if that were the easy part.
Since that talk, we've addressed the scalability issue by implementing
external storage, allowing us to store text on the terabytes of apache
hard drive space which were previously unused. Because of this, we're
less concerned about size now, and more about performance and
manageability. We'd like to have faster backups and much simpler
administration. Effective use of the existing compression and external
storage features has been hampered by high system administration
overhead. Any new storage proposal needs to be evaluated in this context.
The disadvantage to the current compression scheme
seems to me to be that
the wiki software must work on the full text of a set of revisions at a
time (i.e. when uncompressed).
The advantage is that when a number of adjacent revisions are required
(such as during a backup), those revisions can be loaded quickly with a
minimum of seeking.
-- Tim Starling