On Mon, 12 Sep 2005 23:40:55 +1000, Tim Starling wrote:
[...]
No, these possibilities have not been rigorously
examined. Note that
those aren't really reasons. They're illustrative only, addressing each
of them in turn does not guarantee that your compression algorithm is
effective. I was just describing my train of thought in arriving at the
idea that LZ77 might be worth a try. If you have your own idea, please
download a dump and try it out.
I have ideas, I may do that at some point. My instinct is that a diff
scheme that took account of intra-line changes as well as block text moves
would result in the most compression; and that if that alone didn't, then
compressing that diff scheme in the way that the entire revision set
currently is, would. The question I haven't delved into is: how difficult
would it be to do that and how computation-intensive compared to what is
currently done?
The main thing which put me off implementing
diff-based compression was
the complexity, in particular the required schema change. If you need to
load some large number of diffs in order to generate a revision, those
diffs need to be loaded in a single database query, if any kind of
efficiency is to be reached.
Yes, I realise that.
In other words, don't do a proof of principle and
then nag me to write
the real thing, as if that were the easy part.
To start with I want to get an idea of what's currently done and why, and
any ideas previously proposed and/or rejected and why. I understand that
the main wikis are huge and that performance issues are important.
--
http://members.dodo.com.au/~netocrat