As I assume most people here know, each revision in the full history dumps for Mediawiki reports the complete page text. So even though an edit may have changed only a few characters, the entire page is reported for each revision. This is one of the reasons that full history dumps are very large.
Recently I've written some Python code to re-express the revision history into an "edit syntax", using an xml compatible notation for changes with expressions like:
<replace>, <delete>, <insert>, etc.
Since many revisions really only consist of small changes to the text, using the notation I've been developing can greatly reduce the size of the dump, while still maintaining a human readable syntax. For example, I recently ran it against the full history dump of ruwiki (179 GB uncompressed, 1.2 M pages, 11.2 M revisions), and got a 94% reduction in size (11.1 GB). Because it is still a text based format, it stacks well with traditional file compressors (bz2: 89% reduction - 1.24 GB; 7z: 91% reduction - 1.07 GB).
It also could be a precursor to analysis designed to work out "primary" authors and other tasks where one wants to know who is making large edits and who is making small, housekeeping edits.
Obviously, as a compressor it is most successful with large pages which have a large number of relatively minor revisions. For example, the enwiki history of [[Saturn]] (current size 57 kb, 4741 revisions) sees a 99.1% size reduction. I suspect that the size reduction on large wikis, like en or de, would be even greater than the 94% for ruwiki since larger wikis tend to have larger pages and more revisions per page.
The current version of my compressor averaged a little better than 250 revisions per second on ruwiki (about 12 hours total) on a 18-month-old desktop. However, as the CPU utilization was only 50-70% of a full processing core most of the time, I suspect that my choice to read and write from an external hard drive may have been the limiting factor. On a good machine, 400+ rev/s might be a plausible number for the current code. Or in short, the overhead for figuring out my edit syntax is relatively small compared to the generation time for the current dumps (which I'm guessing is limited by communication with the text data store).
My code has some quirks and known bugs, and I'd describe it as a late-stage alpha version at the moment. It still needs considerable work (not to mention documentation) before I would consider it to be something ready for general use.
However, I wanted to know if this is a project of interest to Mediawiki developers or other people. Placed in the dump chain, it could substantially reduce the size of the human readable dumps (at the expense that one would need to process through a series of edits if you wanted see the full-text of any specific revision). Or utilized for different purposes, it could help figure out major vs. minor editors, etc. If this project is mostly just a curiosity for my own use, then I will probably keep the code pretty crude. However, if other people are interested in using something like this, then I am willing to put more effort into developing something that is cleaner and more generally usable.
So, I'd like to know whether there are people (besides myself) who are interested in seeing the full history dumps expressed in an edit syntax rather than the full-text syntax currently used.
-Robert Rohde