Modified repost: 1st msg apparently reached moderator as I received an email but never appeared on gmane nor did I receive a rejection email.
On Mon, 12 Sep 2005 22:51:28 +0200, Tomasz Wegrzanowski wrote:
On Tue, Sep 13, 2005 at 12:24:27AM +1000, Netocrat wrote:
I have ideas, I may do that at some point. My instinct is that a diff scheme that took account of intra-line changes as well as block text moves would result in the most compression; and that if that alone didn't, then compressing that diff scheme in the way that the entire revision set currently is, would. The question I haven't delved into is: how difficult would it be to do that and how computation-intensive compared to what is currently done?
Maybe download a dump and try with xdelta ?
Looks promising. xdelta stores an md5 sum in the differential but xdelta3 allows this to be omitted which for very small changes makes a big difference. There seems to be a bit of other extraneous stuff included (e.g. reconstructed filename) that perhaps can be avoided through using the api instead of the command line. Other than that it does more or less what I described, although it produces binary output and it compromises on optimality for speed.
A preliminary test on *very* limited data shows that it may be a fair improvement over gzip for space.
xdelta3 supports internally compressing the diff with a method such as gzip, which again yields improvements but does not seem to be worth the (significant) extra time required - if this were to be done it should be done on a group of diffs rather than individually. Testing shows that this group approach offers a huge improvement in compression and performance over the one-at-a-time approach. It also, as I expected, again hugely reduces final space requirements compared to the current gzip-a-set-of-revisions approach.
In terms of speed, in the case of a single revision it slightly outperforms gzip at compression. Given that from what I understand (and I may have this wrong) under the current approach gzip must decompress an entire set of revisions and then add the latest revision to that set and recompress it; whereas xdelta3 need compress only the latest revision, this would probably yield an overall improvement.
For decompression speed, xdelta3 again is faster in the case of 1 revision. Here though, the fact that gzip works on multiple sets of revisions at a time goes in its favour. On average gzip would be faster since xdelta3 must perform multiple sequential operations to get at an arbitrary revision but gzip only requires one.
The other related issue that Tim Starling raised - that of accessing all the data in one query - remains open. A stored procedure seems appropriate - I know the MySQL team have proposed them but I don't know if they've yet materialised. The other option is as I described above - gzip a set of xdelta3 diffs. This is the most space-efficient method but also the most computation-intensive (aside from the compress-each-individual-diff approach).
So overall there's potential, but I need to work on some representative data to present a proper summary of the pros and cons. Which presents a David and Goliath problem... 56k dial-up vs 31G xml download. Can anyone suggest a source for a smaller data set in English with some representative multiple-revision articles, preferably a few edit wars etc.