[cross-posted]
----- Original Message -----
From: "Mark A. Hershberger" mhershberger@wikimedia.org
I suppose these are all linked to the parser work that Brion & co are currently working on, but the arrival of the new parser 6 months to a year or more away (http://www.mediawiki.org/wiki/Future/Parser_plan ), I'd like to get these sort of parser issues sorted out now.
My particular hobby horse, the last time that {wikitext-l was really active, I was involved with it heavily} (those are nearly identical, but not quite) was this question, which that wiki page does not seem to address, but the Etherpad might. If not, I still think it's a question that's fundamental to the implementation of a replacement parser, so I'm going to ask it again so everyone's thinking about it as work progresses down that path:
How good is good enough?
How many pages is a replacement parser allowed to break, and still be certified?
That is: what is the *real* spec for mediawikitext? If we say "the formal grammar", then we are *guaranteed* to break some articles. That's the "Right Answer", from up here at 40,000 feet, where I watch from (having the luxury of not being responsible in any way for any of this :-), but it will involve breaking some eggs.
I bring this back up because, the last time we had this conversation, the answer was "nope; the new parser will have to be bug-for-bug compatible with the current one". Or something pretty close to that.
I just think this is a question -- and answer -- that people should be slowly internalizing as we proceed down this path.
1) Formal Spec 2) Multiple Implementations 3) Test Suite
I don't think it's completely unreasonable that we might have a way to grind articles against the current parser, and each new parser, and diff the output. Something like that's the only way I can see that we *will* be able to tell how close new parsers come, and on which constructs they break (not that this means that I think The Wikipedia Corpus constitutes a valid Test Suite :-).
Cheers, -- jra
On 9 July 2011 22:54, Jay Ashworth jra@baylink.com wrote:
How good is good enough? How many pages is a replacement parser allowed to break, and still be certified? That is: what is the *real* spec for mediawikitext? If we say "the formal grammar", then we are *guaranteed* to break some articles. That's the "Right Answer", from up here at 40,000 feet, where I watch from (having the luxury of not being responsible in any way for any of this :-), but it will involve breaking some eggs. I bring this back up because, the last time we had this conversation, the answer was "nope; the new parser will have to be bug-for-bug compatible with the current one". Or something pretty close to that.
I was thinking the answer was obvious:
Brion is about the only man in the world who can get away with anything less than bug-for-bug compatibility, and have his answer accepted. So it's just as well it's him doing it!
(Well, Tim could too. But Brion is an excellent answer.)
- d.
----- Original Message -----
From: "David Gerard" dgerard@gmail.com
On 9 July 2011 22:54, Jay Ashworth jra@baylink.com wrote:
How good is good enough? How many pages is a replacement parser allowed to break, and still be certified?
I was thinking the answer was obvious:
Brion is about the only man in the world who can get away with anything less than bug-for-bug compatibility, and have his answer accepted. So it's just as well it's him doing it!
(Well, Tim could too. But Brion is an excellent answer.)
My understanding was that there were presently between 4 and 6 independent MWT parser projects going on, and that one of those might end up being what went into the mainline; I didn't know Brion was working on {any of them,one of his own}.
Oops.
Question still stands, though.
Cheers, -- jra
On Sat, Jul 9, 2011 at 2:54 PM, Jay Ashworth jra@baylink.com wrote:
How good is good enough?
How many pages is a replacement parser allowed to break, and still be certified?
That is: what is the *real* spec for mediawikitext? If we say "the formal grammar", then we are *guaranteed* to break some articles. That's the "Right Answer", from up here at 40,000 feet, where I watch from (having the luxury of not being responsible in any way for any of this :-), but it will involve breaking some eggs.
I bring this back up because, the last time we had this conversation, the answer was "nope; the new parser will have to be bug-for-bug compatible with the current one". Or something pretty close to that.
Officially speaking, the spec/new implementation will "win".
The theory is to have a defined, reimplementable, more extensible version of pretty much what we have now, so we can always read and work with the data that we've already got.
It will have to be reasonably close to the old behavior, but we *know* it won't be exact. "Reasonable" will, ultimately, be fairly arbitrary, and will be determined based on our progress and findings some months down the road.
Running big batch comparisons determining visible or semantic differences between the old & new parsers will be an ongoing part of that, and so we'll probably have some pretty numbers or graphs to look at as we get closer to something that can be dropped in.
I just think this is a question -- and answer -- that people should be
slowly internalizing as we proceed down this path.
- Formal Spec
- Multiple Implementations
- Test Suite
I don't think it's completely unreasonable that we might have a way to grind articles against the current parser, and each new parser, and diff the output. Something like that's the only way I can see that we *will* be able to tell how close new parsers come, and on which constructs they break (not that this means that I think The Wikipedia Corpus constitutes a valid Test Suite :-).
Definitely. :)
Output comparisons can be a tricky business, but it'll be an important component. I've started on a CLI batch testing framework for the JavaScript parser class (using node.js; in ParserPlayground/tests) that can read through a Wikipedia XML dump and run round-tripping checks; moving on to run HTML output and comparing against output from current MediaWiki parser will be very valuable (though doing comparisons on HTML is tricky!)
-- brion
On 17/07/11 19:16, Brion Vibber wrote:
Output comparisons can be a tricky business, but it'll be an important component. I've started on a CLI batch testing framework for the JavaScript parser class (using node.js; in ParserPlayground/tests) that can read through a Wikipedia XML dump and run round-tripping checks; moving on to run HTML output and comparing against output from current MediaWiki parser will be very valuable (though doing comparisons on HTML is tricky!)
-- brion
There's a rough script for comparing parsers in maintenance folder, but having one parser in php and the other in javascript, it can be hard to do well. Spawning a process for each article would be slow... Perhaps using the spidermonkey pecl extension [1]?
wikitext-l@lists.wikimedia.org