Marco Schuster wrote:
On Wed, Jan 7, 2009 at 10:43 PM, Robert Rohde
<rarohde(a)gmail.com> wrote:
So, I'd like to know whether there are people
(besides myself) who
are interested in seeing the full history dumps expressed in an edit syntax
rather than the full-text syntax currently used.
AFAIR, Tim or Brion are working on a DB compressor with diffs, which
has to be de-compressed for dumping again, so I think your approach in
addition to the diffcompression for storage is a really cool idea.
Do you have the source anywhere so people can look at it?
Yes, I just finished compressing the external storage clusters 13 and 14
from 628 GB down to 30 GB. It should help improve dump speed by reducing
the disk read rate, but the format isn't suitable for interchange. See the
DiffHistoryBlob class in HistoryBlob.php.
It would be possible to uncompress from my format and then recompress into
another diff format for the dump output, as long as the operation is
properly parallelized. I'm not sure how much value there is in trying to
preserve the diff itself from storage to output.
The main difference between my code and the usual solutions is that I
reorder the revisions in order to preserve a good compression ratio in the
scenario of regular page-blanking vandalism. Otherwise, every time the
page is blanked and replaced with a small message, you have to store the
entire revision again, which was the dominant use of space in certain test
cases.
-- Tim Starling