On 1/7/09 1:43 PM, Robert Rohde wrote:
Recently I've written some Python code to re-express the revision history into an "edit syntax", using an xml compatible notation for changes with expressions like:
<replace>,<delete>,<insert>, etc.
Cool!
The current version of my compressor averaged a little better than 250 revisions per second on ruwiki (about 12 hours total) on a 18-month-old desktop. However, as the CPU utilization was only 50-70% of a full processing core most of the time, I suspect that my choice to read and write from an external hard drive may have been the limiting factor. On a good machine, 400+ rev/s might be a plausible number for the current code.
It'd be good to compare this against the general-purpose bzip2 and 7zip LZMA compression...
However, I wanted to know if this is a project of interest to Mediawiki developers or other people. Placed in the dump chain, it could substantially reduce the size of the human readable dumps (at the expense that one would need to process through a series of edits if you wanted see the full-text of any specific revision).
Definitely of interest! If you haven't already, I'd love to see some documentation on the format on mediawiki.org, and it'd be great if we can host the dev code in source control, under extensions or tools for now, until we can integrate something directly into the export code.
-- brion