I am investigating how to write a comprehensive parser regression test. What I mean by this is something you wouldn't normally run frequently, but rather something that we could use to get past the "known to fail" tests now disabled. The problem is no one understands the parser well enough to have confidence that if you fix one of these tests that you will not break something else.
So, I thought, how about using the guts of DumpHTML to create a comprehensive parser regression test. The idea is to have two versions of phase3 + extensions, one without the change you make to the parser to fix a known-to-fail test (call this Base) and one with the change (call this Current). Modify DumpHTML to first visit a page through Base, saving the HTML then visit the same page through Current and compare the two results. Do this for every page in the database. If there are no differences, the change in Current works.
Sitting here I can see the eyeballs of various developers bulging from their faces. "What?" they say. "If you ran this test on, for example, Wikipedia, it could take days to complete." Well, that is one of the things I want to find out. The key to making this test useful is getting the code in the loop (rendering the page twice and testing the results for equality) very efficient. I may not have the skills to do this, but I can at least develop an upper bound on the time it would take to run such a test.
A comprehensive parser regression test would be valuable for:
* fixing the known-to-fail tests. * testing any new parser that some courageous developer decides to code. * testing major releases before they are released. * catching bugs that aren't found by the current parserTest tests. * other things I haven't thought of.
Of course, you wouldn't run this thing nightly or, perhaps, even weekly. Maybe once a month would be enough to ensure the parser hasn't regressed out of sight.
--- On Wed, 8/12/09, dan nessett dnessett@yahoo.com wrote:
"If you ran this test on, for example, Wikipedia,
Of course, what I meant is run the test on the Wikipedia database, not on the live system.
Dan
2009/8/12 dan nessett dnessett@yahoo.com:
I am investigating how to write a comprehensive parser regression test. What I mean by this is something you wouldn't normally run frequently, but rather something that we could use to get past the "known to fail" tests now disabled. The problem is no one understands the parser well enough to have confidence that if you fix one of these tests that you will not break something else.
So, I thought, how about using the guts of DumpHTML to create a comprehensive parser regression test. The idea is to have two versions of phase3 + extensions, one without the change you make to the parser to fix a known-to-fail test (call this Base) and one with the change (call this Current). Modify DumpHTML to first visit a page through Base, saving the HTML then visit the same page through Current and compare the two results. Do this for every page in the database. If there are no differences, the change in Current works.
Sitting here I can see the eyeballs of various developers bulging from their faces. "What?" they say. "If you ran this test on, for example, Wikipedia, it could take days to complete." Well, that is one of the things I want to find out. The key to making this test useful is getting the code in the loop (rendering the page twice and testing the results for equality) very efficient. I may not have the skills to do this, but I can at least develop an upper bound on the time it would take to run such a test.
I read this paragraph first, then read the paragraph above and couldn't help saying "WHAT?!?". Using a huge set of pages is a poor replacement for decent tests. Also, how would you handle intentional changes to the parser output, especially when they're non-trivial?
Roan Kattouw (Catrope)
--- On Wed, 8/12/09, Roan Kattouw roan.kattouw@gmail.com wrote:
I read this paragraph first, then read the paragraph above and couldn't help saying "WHAT?!?". Using a huge set of pages is a poor replacement for decent tests.
I am not proposing that the CPRT be a substitute for "decent tests." We still need a a good set of tests for the whole MW product (not just the parser). Nor would I recommend making a change to the parser and then immediately running the CPRT. Any developer that isn't masochistic would first run the existing parserTests and ensure it passes. Then, you probably want to run the modified DumpHTML against a small random selection of pages in the WP DB. Only if it passes those tests would you then run the CPRT for final assurance.
The CPRT I am proposing is about as good a test of the parser that I can think of. If a change to the parser passes it using the Wikipedia database (currently 5 GB), then I would say for all practical purposes the changes made to the parser do not regress it.
Also, how would you handle intentional changes to the parser output, especially when they're non-trivial?
I don't understand this point. Would you elaborate?
Dan
On Wed, Aug 12, 2009 at 4:48 PM, dan nessettdnessett@yahoo.com wrote:
--- On Wed, 8/12/09, Roan Kattouw roan.kattouw@gmail.com wrote:
I read this paragraph first, then read the paragraph above and couldn't help saying "WHAT?!?". Using a huge set of pages is a poor replacement for decent tests.
I am not proposing that the CPRT be a substitute for "decent tests." We still need a a good set of tests for the whole MW product (not just the parser). Nor would I recommend making a change to the parser and then immediately running the CPRT. Any developer that isn't masochistic would first run the existing parserTests and ensure it passes. Then, you probably want to run the modified DumpHTML against a small random selection of pages in the WP DB. Only if it passes those tests would you then run the CPRT for final assurance.
The CPRT I am proposing is about as good a test of the parser that I can think of. If a change to the parser passes it using the Wikipedia database (currently 5 GB), then I would say for all practical purposes the changes made to the parser do not regress it.
Also, how would you handle intentional changes to the parser output, especially when they're non-trivial?
I don't understand this point. Would you elaborate?
Dan
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
To elaborate on the final point. Sometimes the parser is changed and it breaks output on purpose. Case in point was when Tim rewrote the preprocessor. Some parts of syntax were intentionally changed. You'd have to establish a new baseline for this new behavior at that point.
This also comes down to the fact that we don't have a formal grammar for wikisyntax (basically it's whatever the Parser says it is at any given time). This makes testing the parser hard--we can only give it input and expected output, there's no standard to check against.
Finally, I don't think we need to dump all of enwiki. It can't require that much content to describe the various combinations of wiki syntax...
-Chad
2009/8/12 Chad innocentkiller@gmail.com:
To elaborate on the final point. Sometimes the parser is changed and it breaks output on purpose. Case in point was when Tim rewrote the preprocessor. Some parts of syntax were intentionally changed. You'd have to establish a new baseline for this new behavior at that point.
Another (but fundamentally different) case in point is when Aryeh removed type="" on <script> tags or when I added class="tocsection-n" (n being the section number) to the TOC <li>s. In some case you want to add/remove an attribute or otherwise intentionally change the HTML output, and if that output is covered by parser tests those tests need to be updated. When updating parser test cases it's easy enough to tell the difference between intentional and accidental changes, but this is not so apparent when comparing pre- and post-change parses of big wiki pages (like when Tim added class="mw-redirect" to all links to redirects, which cluttered Special:ParserDiffTest output).
Finally, I don't think we need to dump all of enwiki. It can't require that much content to describe the various combinations of wiki syntax...
Exactly. Instead of throwing a huge amount of wikitext at it and hoping that'll cover everything, we should make our test suite more comprehensive by adding lots of new parser test. Of course there'll be *some* crazy bugs concerning weird interactions that only happen in the wild, you can't really write tests for those, but that's life.
Roan Kattouw (Catrope)
On 8/12/09 4:05 PM, Roan Kattouw wrote:
Exactly. Instead of throwing a huge amount of wikitext at it and hoping that'll cover everything, we should make our test suite more comprehensive by adding lots of new parser test. Of course there'll be *some* crazy bugs concerning weird interactions that only happen in the wild, you can't really write tests for those, but that's life.
Real-page testing is like fuzz testing; the point is never to replace explicit tests, but that it helps you find things you didn't think of yourself.
-- brion
On 8/12/09 2:55 PM, Chad wrote:
To elaborate on the final point. Sometimes the parser is changed and it breaks output on purpose. Case in point was when Tim rewrote the preprocessor. Some parts of syntax were intentionally changed. You'd have to establish a new baseline for this new behavior at that point.
Knowing when things change is important -- you want to know that things changed *on purpose* not *by accident*.
And yes, that means that baseline shifts can be painful as we have to go through and see if any of the changes were regressions, but that's the price of knowing what's going on. ;)
-- brion
Chad wrote:
To elaborate on the final point. Sometimes the parser is changed and it breaks output on purpose. Case in point was when Tim rewrote the preprocessor. Some parts of syntax were intentionally changed. You'd have to establish a new baseline for this new behavior at that point.
This also comes down to the fact that we don't have a formal grammar for wikisyntax (basically it's whatever the Parser says it is at any given time). This makes testing the parser hard--we can only give it input and expected output, there's no standard to check against.
Finally, I don't think we need to dump all of enwiki. It can't require that much content to describe the various combinations of wiki syntax...
In principle, I rather like the idea of using the entire English Wikipedia (or why limit to that? we have plenty of other projects too) as a parser test, or at least of having the ability to do that if we want.
You see, the flip side to not having a formal grammar for wikimarkup is that we also don't have a spec sheet for it: the best description of how people actually expect the parser to behave and what features they expect it to support is what they're actually using it for on their wikis. And en.wikipedia is the biggest and ugliest of the bunch.
There's no way we can ever write a test suite comprehensive enough to cover every single feature, bug, quirk and coincidence that actual wiki pages and templates may have come to rely on. That's simply because for every MediaWiki coder there are dozens or hundreds of template writers and thousands of other editors.
In a way, all those editors form the biggest, most thorough fuzz tester there can be. The only problem is that it's also a rather inefficient one, even for a fuzz tester: most wiki pages exercise only a fairly small and boring set of parser features. But at least, if one were to, say, run a random sample of a few thousand Wikipedia pages through the parser and observe no unexpected changes in the output, one could start to make some statistical predictions about how many of the remaining pages one could at worst expect to break.
The real problem, as noted elsewhere in the thread, is of course filtering the unexpected changes from any expected ones. A partial solution could be having the test implementation extract the changes -- we conveniently have a word-level diff implementation available already -- and combining any duplicates.
Another, complementary approach would be to allow the person running the tests to postprocess the two outputs before they're compared, so as to try and eliminate any expected differences. Of course, this would require some significant extra effort on the part of that person, beyond just typing "php runSomeTests.php" and hitting enter, but then again, throughly analyzing the effects of a major parser change is a nontrivial exercise anyway, no matter what. And for things that _shouldn't_ cause any changes to the parser output, it really could be just as easy, in principle at least, as running parserTests currently is.
dan nessett dnessett@yahoo.com wrote:
I am investigating how to write a comprehensive parser regression test. What I mean by this is something you wouldn't normally run frequently, but rather something that we could use to get past the "known to fail" tests now disabled. The problem is no one understands the parser well enough to have confidence that if you fix one of these tests that you will not break something else.
So, I thought, how about using the guts of DumpHTML to create a comprehensive parser regression test. The idea is to have two versions of phase3 + extensions, one without the change you make to the parser to fix a known-to-fail test (call this Base) and one with the change (call this Current). Modify DumpHTML to first visit a page through Base, saving the HTML then visit the same page through Current and compare the two results. Do this for every page in the database. If there are no differences, the change in Current works. [...]
I use a similar approach on a toolserver script in addition to smaller tests: I saved several revisions of "interesting" wiki pages and the respective output of the then-current script version to the subversion repository. Before commit- ting changes, I run a test whether the current script produ- ces the same results. If the results are different, either a bug needs to be fixed or the expected output be amended (in the same commit).
Tim
P. S.: I find your dedication to QA very laudable; I think though that more people would read and embrace your thoughts if you would find a more concise way to put them across :-).
--- On Wed, 8/12/09, Tim Landscheidt tim@tim-landscheidt.de wrote:
I think though that more people would read and embrace your thoughts if you would find a more concise way to put them across :-).
Mea Culpa. I'll shut up for a while.
dan nessett dnessett@yahoo.com wrote:
I think though that more people would read and embrace your thoughts if you would find a more concise way to put them across :-).
Mea Culpa. I'll shut up for a while.
That's the right form with the wrong message :-).
Tim
wikitech-l@lists.wikimedia.org