[Resending without the full list of articles, which caused the message to be bounced into moderation.]
Here are the results of a quick test I ran over the weekend, comparing a compressed excerpt from simple.wikipedia.org in mediawiki markup to the compressed parsoid representation of the same articles. The list of articles is attached to this message. [Not any more.]
For the base case I used the processing pipeline for the OLPC's "Wikipedia activity", source code at github.com/cscott/wikiserver It begins with a hand-written "portal page", then grabs all articles within two links of the portal page. The original markup was taken from the simplewiki-20130112-pages-articles.xml dump. Templates were then fully expanded, and just the selected articles were written. Articles are separated by the character 0x01, a newline, the title of the article, a newline, the length of the article in bytes, a newline, and the character 0x02 and a newline.
For comparison, I took the list of articles included in the dump and wrote a small script to fetch them from parsoid, using the HEAD of the master branch from this weekend (2013-02-17, roughly). I wrote the full parsoid HTML document (including top-level <html> tag, <head>, <base href>, and <body> but not including a <!DOCTYPE>) to a file, separating articles with the title of the article, a newline, the length of the article in bytes, and a newline.
Results, with and without compression:
# of articles: 3640 Mediawiki markup, uncompressed: 18M Parsoid markup, uncompressed: 199M
Mediawiki markup, gzip -9 compressed: 6.4M Parsoid markup, gzip -9 compressed: 26M
Mediawiki markup, bzip2 -9 compressed: 4.7M Parsoid markup, bzip2 -9 compressed: 17M
Mediawiki markup, lzma -9 compressed: 4.4M Parsoid markup, lzma -9 compressed: 15M
So there's currently a 10x expansion in the uncompressed size, but only 3-4x expansion with compression. --scott