I am looking into the feasibility of writing a comprehensive parser regression test (CPRT). Before writing code, I thought I would try to get some idea of how well such a tool would perform and what gotchas might pop up. An easy first step is to run dump_HTML and capture some data and statistics.
I tried to run the version of dumpHTML in r54724, but it failed. So, I went back to 1.14 and ran that version against a small personal wiki database I have. I did this to get an idea of what structures dump_HTML produces and also to get some performance data with which to do projections of runtime/resource usage.
I ran dumpHTML twice using the same MW version and same database. I then diff'd the two directories produced. One would expect no differences, but that expectation is wrong. I got a bunch of diffs of the following form (I have put a newline between the two file names to shorten the line length):
diff -r HTML_Dump/articles/d/n/e/User~Dnessett_Bref_Examples_Example1_Chapter_1_4083.html HTML_Dump2/articles/d/n/e/User~Dnessett_Bref_Examples_Example1_Chapter_1_4083.html 77,78c77,78 < Post-expand include size: 16145/2097152 bytes < Template argument size: 12139/2097152 bytes ---
Post-expand include size: 16235/2097152 bytes Template argument size: 12151/2097152 bytes
I looked at one of the html files to see where these differences appear. They occur in an html comment:
<!-- NewPP limit report Preprocessor node count: 1891/1000000 Post-expand include size: 16145/2097152 bytes Template argument size: 12139/2097152 bytes Expensive parser function count: 0/100 -->
Does anyone have an idea of what this is for? Is there any way to configure MW so it isn't produced?
I will post some performance data later.
Dan