On 1/7/09 1:43 PM, Robert Rohde wrote:
Recently I've written some Python code to
re-express the revision
history into an "edit syntax", using an xml compatible notation for
changes with expressions like:
<replace>,<delete>,<insert>, etc.
Cool!
The current version of my compressor averaged a little
better than 250
revisions per second on ruwiki (about 12 hours total) on a
18-month-old desktop. However, as the CPU utilization was only 50-70%
of a full processing core most of the time, I suspect that my choice
to read and write from an external hard drive may have been the
limiting factor. On a good machine, 400+ rev/s might be a plausible
number for the current code.
It'd be good to compare this against the general-purpose bzip2 and 7zip
LZMA compression...
However, I wanted to know if this is a project of
interest to
Mediawiki developers or other people. Placed in the dump chain, it
could substantially reduce the size of the human readable dumps (at
the expense that one would need to process through a series of edits if
you wanted see the full-text of any specific revision).
Definitely of interest! If you haven't already, I'd love to see some
documentation on the format on
mediawiki.org, and it'd be great if we
can host the dev code in source control, under extensions or tools for
now, until we can integrate something directly into the export code.
-- brion