On Wed, Jan 7, 2009 at 5:22 PM, Tim Starling <tstarling(a)wikimedia.org> wrote:
Marco Schuster wrote:
On Wed, Jan 7, 2009 at 10:43 PM, Robert Rohde
<rarohde(a)gmail.com> wrote:
So, I'd like to know whether there are people
(besides myself) who
are interested in seeing the full history dumps expressed in an edit syntax
rather than the full-text syntax currently used.
AFAIR, Tim or Brion are working on a DB compressor with diffs, which
has to be de-compressed for dumping again, so I think your approach in
addition to the diffcompression for storage is a really cool idea.
Do you have the source anywhere so people can look at it?
Yes, I just finished compressing the external storage clusters 13 and 14
from 628 GB down to 30 GB. It should help improve dump speed by reducing
the disk read rate, but the format isn't suitable for interchange. See the
DiffHistoryBlob class in HistoryBlob.php.
It would be possible to uncompress from my format and then recompress into
another diff format for the dump output, as long as the operation is
properly parallelized. I'm not sure how much value there is in trying to
preserve the diff itself from storage to output.
The main difference between my code and the usual solutions is that I
reorder the revisions in order to preserve a good compression ratio in the
scenario of regular page-blanking vandalism. Otherwise, every time the
page is blanked and replaced with a small message, you have to store the
entire revision again, which was the dominant use of space in certain test
cases.
My approach to that case was to hash each revision and keep a record
of the hashes already seen for the article. If the same hash
appeared more than once, my diff syntax instructed it to "<revert>" to
the previous revision. My solution doesn't address the case that
someone reverts page-blanking vandalism and edits the text at the same
time, but that case is quite rare in practice.
For a bit more background, my current implementation breaks the
article in the lines (newline character splits), and intelligently
handles the following kinds of changes:
line insertion
line deletion
line replacement (major edits to a line)
text replacement (small edits within a line)
line reordering (handling the case of sections being reordering
improves significantly upon diff generators that ignore this case)
article truncation
article replacement (or so many edits that it is more compact to
simply specify the new version)
article reversion to prior version
appending to article
-Robert Rohde