Modified repost: 1st msg apparently reached moderator as I received an
email but never appeared on gmane nor did I receive a rejection email.
On Mon, 12 Sep 2005 22:51:28 +0200, Tomasz Wegrzanowski wrote:
On Tue, Sep 13, 2005 at 12:24:27AM +1000, Netocrat
wrote:
I have ideas, I may do that at some point. My
instinct is that a diff
scheme that took account of intra-line changes as well as block text moves
would result in the most compression; and that if that alone didn't, then
compressing that diff scheme in the way that the entire revision set
currently is, would. The question I haven't delved into is: how difficult
would it be to do that and how computation-intensive compared to what is
currently done?
Maybe download a dump and try with xdelta ?
Looks promising. xdelta stores an md5 sum in the differential but xdelta3
allows this to be omitted which for very small changes makes a big
difference. There seems to be a bit of other extraneous stuff included
(e.g. reconstructed filename) that perhaps can be avoided through using
the api instead of the command line. Other than that it does more or less
what I described, although it produces binary output and it compromises on
optimality for speed.
A preliminary test on *very* limited data shows that it may be a fair
improvement over gzip for space.
xdelta3 supports internally compressing the diff with a method such as
gzip, which again yields improvements but does not seem to be worth the
(significant) extra time required - if this were to be done it should be
done on a group of diffs rather than individually. Testing shows that
this group approach offers a huge improvement in compression and
performance over the one-at-a-time approach. It also, as I expected,
again hugely reduces final space requirements compared to the current
gzip-a-set-of-revisions approach.
In terms of speed, in the case of a single revision it slightly
outperforms gzip at compression. Given that from what I understand
(and I may have this wrong) under the current approach gzip must
decompress an entire set of revisions and then add the latest revision to
that set and recompress it; whereas xdelta3 need compress only the latest
revision, this would probably yield an overall improvement.
For decompression speed, xdelta3 again is faster in the case of 1
revision. Here though, the fact that gzip works on multiple sets
of revisions at a time goes in its favour. On average gzip would be
faster since xdelta3 must perform multiple sequential operations to get at
an arbitrary revision but gzip only requires one.
The other related issue that Tim Starling raised - that of accessing all
the data in one query - remains open. A stored procedure seems
appropriate - I know the MySQL team have proposed them but I don't know if
they've yet materialised. The other option is as I described above - gzip
a set of xdelta3 diffs. This is the most space-efficient method but also
the most computation-intensive (aside from the
compress-each-individual-diff approach).
So overall there's potential, but I need to work on some representative
data to present a proper summary of the pros and cons. Which presents a
David and Goliath problem... 56k dial-up vs 31G xml download. Can anyone
suggest a source for a smaller data set in English with some
representative multiple-revision articles, preferably a few edit wars etc.
--
http://members.dodo.com.au/~netocrat