As I assume most people here know, each revision in the full history
dumps for Mediawiki reports the complete page text. So even though an
edit may have changed only a few characters, the entire page is
reported for each revision. This is one of the reasons that full
history dumps are very large.
Recently I've written some Python code to re-express the revision
history into an "edit syntax", using an xml compatible notation for
changes with expressions like:
<replace>, <delete>, <insert>, etc.
Since many revisions really only consist of small changes to the text,
using the notation I've been developing can greatly reduce the size of
the dump, while still maintaining a human readable syntax. For
example, I recently ran it against the full history dump of ruwiki
(179 GB uncompressed, 1.2 M pages, 11.2 M revisions), and got a 94%
reduction in size (11.1 GB). Because it is still a text based format,
it stacks well with traditional file compressors (bz2: 89% reduction -
1.24 GB; 7z: 91% reduction - 1.07 GB).
It also could be a precursor to analysis designed to work out
"primary" authors and other tasks where one wants to know who is
making large edits and who is making small, housekeeping edits.
Obviously, as a compressor it is most successful with large pages which have
a large number of relatively minor revisions. For example, the enwiki
history of [[Saturn]] (current size 57 kb, 4741 revisions) sees a
99.1% size reduction. I suspect that the size reduction on large
wikis, like en or de, would be even greater than the 94% for ruwiki
since larger wikis tend to have larger pages and more revisions per
page.
The current version of my compressor averaged a little better than 250
revisions per second on ruwiki (about 12 hours total) on a
18-month-old desktop. However, as the CPU utilization was only 50-70%
of a full processing core most of the time, I suspect that my choice
to read and write from an external hard drive may have been the
limiting factor. On a good machine, 400+ rev/s might be a plausible
number for the current code. Or in short, the overhead for figuring
out my edit syntax is relatively small compared to the generation time
for the current dumps (which I'm guessing is limited by communication
with the text data store).
My code has some quirks and known bugs, and I'd describe it as a
late-stage alpha version at the moment. It still needs considerable
work (not to mention documentation) before I would consider it to be
something ready for general use.
However, I wanted to know if this is a project of interest to
Mediawiki developers or other people. Placed in the dump chain, it
could substantially reduce the size of the human readable dumps (at
the expense that one would need to process through a series of edits if
you wanted see the full-text of any specific revision). Or utilized
for different
purposes, it could help figure out major vs. minor editors, etc. If
this project is mostly just a curiosity for my own use, then I will
probably keep the code pretty crude. However, if other people are
interested in using something like this, then I am willing to put more
effort into developing something that is cleaner and more generally
usable.
So, I'd like to know whether there are people (besides myself) who
are interested in seeing the full history dumps expressed in an edit syntax
rather than the full-text syntax currently used.
-Robert Rohde