CPRT feasibility - Wikitech-l

20 Aug 2009

I am looking into the feasibility of writing a comprehensive parser regression test
(CPRT). Before writing code, I thought I would try to get some idea of how well such a
tool would perform and what gotchas might pop up. An easy first step is to run dump_HTML
and capture some data and statistics.

I tried to run the version of dumpHTML in r54724, but it failed. So, I went back to 1.14
and ran that version against a small personal wiki database I have. I did this to get an
idea of what structures dump_HTML produces and also to get some performance data with
which to do projections of runtime/resource usage.

I ran dumpHTML twice using the same MW version and same database. I then diff'd the
two directories produced. One would expect no differences, but that expectation is wrong.
I got a bunch of diffs of the following form (I have put a newline between the two file
names to shorten the line length):

diff -r HTML_Dump/articles/d/n/e/User~Dnessett_Bref_Examples_Example1_Chapter_1_4083.html

HTML_Dump2/articles/d/n/e/User~Dnessett_Bref_Examples_Example1_Chapter_1_4083.html
77,78c77,78
< Post-expand include size: 16145/2097152 bytes
< Template argument size: 12139/2097152 bytes
---
...
  Post-expand include size: 16235/2097152 bytes
 Template argument size: 12151/2097152 bytes 
I looked at one of the html files to see where these differences appear. They occur in an
html comment:

<!-- 
NewPP limit report
Preprocessor node count: 1891/1000000
Post-expand include size: 16145/2097152 bytes
Template argument size: 12139/2097152 bytes
Expensive parser function count: 0/100
-->

Does anyone have an idea of what this is for? Is there any way to configure MW so it
isn't produced?

I will post some performance data later.

Dan