On Wed, Jan 7, 2009 at 5:22 PM, Tim Starling tstarling@wikimedia.org wrote:
Marco Schuster wrote:
On Wed, Jan 7, 2009 at 10:43 PM, Robert Rohde rarohde@gmail.com wrote:
So, I'd like to know whether there are people (besides myself) who are interested in seeing the full history dumps expressed in an edit syntax rather than the full-text syntax currently used.
AFAIR, Tim or Brion are working on a DB compressor with diffs, which has to be de-compressed for dumping again, so I think your approach in addition to the diffcompression for storage is a really cool idea. Do you have the source anywhere so people can look at it?
Yes, I just finished compressing the external storage clusters 13 and 14 from 628 GB down to 30 GB. It should help improve dump speed by reducing the disk read rate, but the format isn't suitable for interchange. See the DiffHistoryBlob class in HistoryBlob.php.
It would be possible to uncompress from my format and then recompress into another diff format for the dump output, as long as the operation is properly parallelized. I'm not sure how much value there is in trying to preserve the diff itself from storage to output.
The main difference between my code and the usual solutions is that I reorder the revisions in order to preserve a good compression ratio in the scenario of regular page-blanking vandalism. Otherwise, every time the page is blanked and replaced with a small message, you have to store the entire revision again, which was the dominant use of space in certain test cases.
My approach to that case was to hash each revision and keep a record of the hashes already seen for the article. If the same hash appeared more than once, my diff syntax instructed it to "<revert>" to the previous revision. My solution doesn't address the case that someone reverts page-blanking vandalism and edits the text at the same time, but that case is quite rare in practice.
For a bit more background, my current implementation breaks the article in the lines (newline character splits), and intelligently handles the following kinds of changes:
line insertion line deletion line replacement (major edits to a line) text replacement (small edits within a line) line reordering (handling the case of sections being reordering improves significantly upon diff generators that ignore this case) article truncation article replacement (or so many edits that it is more compact to simply specify the new version) article reversion to prior version appending to article
-Robert Rohde