[Resending without the full list of articles, which caused the message to be bounced into moderation.]

Here are the results of a quick test I ran over the weekend, comparing a compressed excerpt from simple.wikipedia.org in mediawiki markup to the compressed parsoid representation of the same articles.  The list of articles is attached to this message. [Not any more.]

For the base case I used the processing pipeline for the OLPC's "Wikipedia activity", source code at github.com/cscott/wikiserver
It begins with a hand-written "portal page", then grabs all articles within two links of the portal page.  The original markup was taken from the simplewiki-20130112-pages-articles.xml dump.  Templates were then fully expanded, and just the selected articles were written.  Articles are separated by the character 0x01, a newline, the title of the article, a newline, the length of the article in bytes, a newline, and the character 0x02 and a newline.

For comparison, I took the list of articles included in the dump and wrote a small script to fetch them from parsoid, using the HEAD of the master branch from this weekend (2013-02-17, roughly).  I wrote the full parsoid HTML document (including top-level <html> tag, <head>, <base href>, and <body> but not including a <!DOCTYPE>) to a file, separating articles with the title of the article, a newline, the length of the article in bytes, and a newline.

Results, with and without compression:

# of articles: 3640
Mediawiki markup, uncompressed: 18M
Parsoid markup, uncompressed: 199M

Mediawiki markup, gzip -9 compressed: 6.4M
Parsoid markup, gzip -9 compressed: 26M

Mediawiki markup, bzip2 -9 compressed: 4.7M
Parsoid markup, bzip2 -9 compressed: 17M

Mediawiki markup, lzma -9 compressed: 4.4M
Parsoid markup, lzma -9 compressed: 15M

So there's currently a 10x expansion in the uncompressed size, but only 3-4x expansion with compression.
  --scott

--
                         ( http://cscott.net/ )