Marco Schuster wrote:
On Wed, Jan 7, 2009 at 10:43 PM, Robert Rohde rarohde@gmail.com wrote:
So, I'd like to know whether there are people (besides myself) who are interested in seeing the full history dumps expressed in an edit syntax rather than the full-text syntax currently used.
AFAIR, Tim or Brion are working on a DB compressor with diffs, which has to be de-compressed for dumping again, so I think your approach in addition to the diffcompression for storage is a really cool idea. Do you have the source anywhere so people can look at it?
Yes, I just finished compressing the external storage clusters 13 and 14 from 628 GB down to 30 GB. It should help improve dump speed by reducing the disk read rate, but the format isn't suitable for interchange. See the DiffHistoryBlob class in HistoryBlob.php.
It would be possible to uncompress from my format and then recompress into another diff format for the dump output, as long as the operation is properly parallelized. I'm not sure how much value there is in trying to preserve the diff itself from storage to output.
The main difference between my code and the usual solutions is that I reorder the revisions in order to preserve a good compression ratio in the scenario of regular page-blanking vandalism. Otherwise, every time the page is blanked and replaced with a small message, you have to store the entire revision again, which was the dominant use of space in certain test cases.
-- Tim Starling