Size: Parsoid output vs Mediawiki markup - Wikitext-l

19 Feb 2013


      [Resending without the full list of articles, which caused the message to
be bounced into moderation.]
Here are the results of a quick test I ran over the weekend, comparing a
compressed excerpt from simple.wikipedia.org in mediawiki markup to the
compressed parsoid representation of the same articles.  The list of
articles is attached to this message. [Not any more.]
For the base case I used the processing pipeline for the OLPC's "Wikipedia
activity", source code at github.com/cscott/wikiserver
It begins with a hand-written "portal page", then grabs all articles within
two links of the portal page.  The original markup was taken from
the simplewiki-20130112-pages-articles.xml dump.  Templates were then fully
expanded, and just the selected articles were written.  Articles are
separated by the character 0x01, a newline, the title of the article, a
newline, the length of the article in bytes, a newline, and the character
0x02 and a newline.
For comparison, I took the list of articles included in the dump and wrote
a small script to fetch them from parsoid, using the HEAD of the master
branch from this weekend (2013-02-17, roughly).  I wrote the full parsoid
HTML document (including top-level <html> tag, <head>, <base href>, and
<body> but not including a <!DOCTYPE>) to a file, separating articles with
the title of the article, a newline, the length of the article in bytes,
and a newline.
Results, with and without compression:
# of articles: 3640
Mediawiki markup, uncompressed: 18M
Parsoid markup, uncompressed: 199M
Mediawiki markup, gzip -9 compressed: 6.4M
Parsoid markup, gzip -9 compressed: 26M
Mediawiki markup, bzip2 -9 compressed: 4.7M
Parsoid markup, bzip2 -9 compressed: 17M
Mediawiki markup, lzma -9 compressed: 4.4M
Parsoid markup, lzma -9 compressed: 15M
So there's currently a 10x expansion in the uncompressed size, but only
3-4x expansion with compression.
  --scott
-- 
                         ( http://cscott.net/ )