I've uploaded my demonstration code to:
http://svn.wikimedia.org/viewvc/mediawiki/trunk/tools/editsyntax/
The three files are
EditSyntax.py - the main file providing the functions for re-expressing the revision history into my "edit syntax".
ConvertToEditSyntax.py - a utility to perform compressions ConvertFromEditSyntax.py - a utility to perform decompressions
Both utility files can be run from the command line, i.e.:
ConvertToEditSyntax.py [-v] input_file output_file ConvertFromEditSyntax.py [-v] input_file output_file
"-v" is an optional flag for "verbose mode" which gives rolling feedback on the program's progress.
ConvertToEditSyntax expects as input either a full history database dump (i.e. "pages-meta-history.xml") or the output from Special:Export with revisions included.
A round-trip via ConvertTo followed by ConvertFrom should be 100% identical to the original.
For simple demonstration purposes, a smallish wiki (something with a 7z size of 5-10 MB), can provide a few hundred thousand revisions and take a couple minutes to execute. I've been partial to cowiki during testing, for no particular reason.
One can also use Special:Export to get histories of large pages to test. Using large pages from enwiki might be a good place to start because the content is easy to understand.
My largest test case has been ruwiki (12M revisions) which currently takes 10 hours and was described in the original email in this thread. Since that original email I've increased both the execution speed and the compression ratio by a little bit.
I've provided a little bit of internal documentation, but not a lot so far, so obviously feel free to ask questions if you have them.
-Robert Rohde
On Fri, Jan 16, 2009 at 11:21 AM, Brion Vibber brion@wikimedia.org wrote:
I've set Robert up with a SVN account (rarohde) to put up the proof of concept converter code.
-- brion
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l