I am looking into the feasibility of writing a comprehensive parser regression test (CPRT). Before writing code, I thought I would try to get some idea of how well such a tool would perform and what gotchas might pop up. An easy first step is to run dump_HTML and capture some data and statistics.
I tried to run the version of dumpHTML in r54724, but it failed. So, I went back to 1.14 and ran that version against a small personal wiki database I have. I did this to get an idea of what structures dump_HTML produces and also to get some performance data with which to do projections of runtime/resource usage.
I ran dumpHTML twice using the same MW version and same database. I then diff'd the two directories produced. One would expect no differences, but that expectation is wrong. I got a bunch of diffs of the following form (I have put a newline between the two file names to shorten the line length):
diff -r HTML_Dump/articles/d/n/e/User~Dnessett_Bref_Examples_Example1_Chapter_1_4083.html HTML_Dump2/articles/d/n/e/User~Dnessett_Bref_Examples_Example1_Chapter_1_4083.html 77,78c77,78 < Post-expand include size: 16145/2097152 bytes < Template argument size: 12139/2097152 bytes ---
Post-expand include size: 16235/2097152 bytes Template argument size: 12151/2097152 bytes
I looked at one of the html files to see where these differences appear. They occur in an html comment:
<!-- NewPP limit report Preprocessor node count: 1891/1000000 Post-expand include size: 16145/2097152 bytes Template argument size: 12139/2097152 bytes Expensive parser function count: 0/100 -->
Does anyone have an idea of what this is for? Is there any way to configure MW so it isn't produced?
I will post some performance data later.
Dan
On 20/08/2009, at 6:19 PM, dan nessett wrote:
<!-- NewPP limit report Preprocessor node count: 1891/1000000 Post-expand include size: 16145/2097152 bytes Template argument size: 12139/2097152 bytes Expensive parser function count: 0/100 -->
Does anyone have an idea of what this is for? Is there any way to configure MW so it isn't produced?
As the title implies, it is a performance limit report. You can remove it by changing the parser options passed to the parser. Look at the ParserOptions and Parser classes.
-- Andrew Garrett agarrett@wikimedia.org http://werdn.us/
--- On Thu, 8/20/09, Andrew Garrett agarrett@wikimedia.org wrote:
As the title implies, it is a performance limit report. You can removeĀ it by changing the parser options passed to the parser. Look at theĀ ParserOptions and Parser classes.
Thanks. It appears dumpHTML has no command option to turn off this report (the parser option is mEnableLimitReport).
A question to the developer community: Is it better to change dumpHTML to accept a new option (to turn off Limit Reports) or copy dumpHTML into a new CPRT extension and change it. I strongly feel that having two extensions with essentially the same functionality is bad practice. On the other hand, changing dumpHTML means it becomes dual purposed, which has the potential of making it big and ugly. One compromise position is to attempt to factor dumpHTML so that a core provides common functionality to two different upper layers. However, I don't know if that is acceptable practice for extensions.
A short term fix is to pipe the output of dumpHTML through a filter that removes the Limit Report. That would allow developers to use dumpHTML (as a CPRT) fairly quickly to find and fix the known-to-fail parser bugs. The downside to this is it may significantly degrade performance.
Dan
wikitech-l@lists.wikimedia.org