I've uploaded my demonstration code to:
The three files are
EditSyntax.py - the main file providing the functions for
re-expressing the revision history into my "edit syntax".
ConvertToEditSyntax.py - a utility to perform compressions
ConvertFromEditSyntax.py - a utility to perform decompressions
Both utility files can be run from the command line, i.e.:
ConvertToEditSyntax.py [-v] input_file output_file
ConvertFromEditSyntax.py [-v] input_file output_file
"-v" is an optional flag for "verbose mode" which gives rolling
feedback on the program's progress.
ConvertToEditSyntax expects as input either a full history database
dump (i.e. "pages-meta-history.xml") or the output from Special:Export
with revisions included.
A round-trip via ConvertTo followed by ConvertFrom should be 100%
identical to the original.
For simple demonstration purposes, a smallish wiki (something with a
7z size of 5-10 MB), can provide a few hundred thousand revisions and
take a couple minutes to execute. I've been partial to cowiki during
testing, for no particular reason.
One can also use Special:Export to get histories of large pages to
test. Using large pages from enwiki might be a good place to start
because the content is easy to understand.
My largest test case has been ruwiki (12M revisions) which currently
takes 10 hours and was described in the original email in this thread.
Since that original email I've increased both the execution speed and
the compression ratio by a little bit.
I've provided a little bit of internal documentation, but not a lot so
far, so obviously feel free to ask questions if you have them.
On Fri, Jan 16, 2009 at 11:21 AM, Brion Vibber <brion(a)wikimedia.org> wrote:
I've set Robert up with a SVN account (rarohde) to
put up the proof of
concept converter code.
Wikitech-l mailing list