[Resending without the full list of articles, which caused the message to be bounced into moderation.]
Here are the results of a quick test I ran over the weekend, comparing a compressed excerpt from simple.wikipedia.org in mediawiki markup to the compressed parsoid representation of the same articles. The list of articles is attached to this message. [Not any more.]
For the base case I used the processing pipeline for the OLPC's "Wikipedia activity", source code at github.com/cscott/wikiserver It begins with a hand-written "portal page", then grabs all articles within two links of the portal page. The original markup was taken from the simplewiki-20130112-pages-articles.xml dump. Templates were then fully expanded, and just the selected articles were written. Articles are separated by the character 0x01, a newline, the title of the article, a newline, the length of the article in bytes, a newline, and the character 0x02 and a newline.
For comparison, I took the list of articles included in the dump and wrote a small script to fetch them from parsoid, using the HEAD of the master branch from this weekend (2013-02-17, roughly). I wrote the full parsoid HTML document (including top-level <html> tag, <head>, <base href>, and <body> but not including a <!DOCTYPE>) to a file, separating articles with the title of the article, a newline, the length of the article in bytes, and a newline.
Results, with and without compression:
# of articles: 3640 Mediawiki markup, uncompressed: 18M Parsoid markup, uncompressed: 199M
Mediawiki markup, gzip -9 compressed: 6.4M Parsoid markup, gzip -9 compressed: 26M
Mediawiki markup, bzip2 -9 compressed: 4.7M Parsoid markup, bzip2 -9 compressed: 17M
Mediawiki markup, lzma -9 compressed: 4.4M Parsoid markup, lzma -9 compressed: 15M
So there's currently a 10x expansion in the uncompressed size, but only 3-4x expansion with compression. --scott
On 02/19/2013 03:52 PM, C. Scott Ananian wrote:
So there's currently a 10x expansion in the uncompressed size, but only 3-4x expansion with compression.
My last test after https://gerrit.wikimedia.org/r/#/c/49185/ was merged showed a gzip-compressed factor of about 2 for a large article:
259K obama-parsoid-old.html.gz 255K obama-parsoid-adaptive-attribute-quoting.html.gz 135K obama-PHP.html.gz
We currently store all round-trip information (plus some debug info) in the DOM, but plan to move most of this information out of it. The information is private in any case, so there is no reason to send it out along with the DOM. We might keep some UID attributes to aid node identification, but there is also the possibility to use subtree hashes as in the XyDiff algorithm to help with that.
In the end, the resulting DOM will likely still be slightly larger than the PHP parser's output as it contains more information, in particular about templates.
Gabriel
It would be interesting to determine whether the 'Barack Obama' article is an outlier. It could be that the simple English wikipedia has a larger ratio of text to markup, and thus the parsoid output is comparatively chunkier. "Barack Obama" may have a large amount of wikitext markup already, so the parsoid output isn't as (comparatively) large. If I get some free time I'll try to do some experiments to determine what's going on. --scott
On Tue, Feb 19, 2013 at 7:35 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
On 02/19/2013 03:52 PM, C. Scott Ananian wrote:
So there's currently a 10x expansion in the uncompressed size, but only 3-4x expansion with compression.
My last test after https://gerrit.wikimedia.org/r/#/c/49185/ was merged showed a gzip-compressed factor of about 2 for a large article:
259K obama-parsoid-old.html.gz 255K obama-parsoid-adaptive-attribute-quoting.html.gz 135K obama-PHP.html.gz
We currently store all round-trip information (plus some debug info) in the DOM, but plan to move most of this information out of it. The information is private in any case, so there is no reason to send it out along with the DOM. We might keep some UID attributes to aid node identification, but there is also the possibility to use subtree hashes as in the XyDiff algorithm to help with that.
In the end, the resulting DOM will likely still be slightly larger than the PHP parser's output as it contains more information, in particular about templates.
Gabriel
wikitext-l@lists.wikimedia.org