On Fri, 09 Nov 2007 15:24:10 +1100, Steve Bennett wrote:
On 11/9/07, Simetrical Simetrical+wikilist@gmail.com wrote:
According to flex documentation, it's perfectly happy to accept any regex for tokens, and will use unlimited lookahead and backtracking if necessary. It provides debug info allowing you to check for and eliminate backtracking, if you want to speed it up, but that's optional. Clearly it's not possible to tokenize MW markup with one-character lookahead: you sure can't tell the difference between a second- and sixth-level heading, and of course that's even ignoring
Yes you can, if ====== is a token. Which at first glance, it should be. The fact that == looks like === looks like ==== is neither here nor there to the grammar - it's a handy mnemonic for humans, that's all.
Well, that's exactly the point. At first glance, === is obviously a token, which will perfectly handle 99% of the headings out there. But if we want a complete grammar, we really need sane handling for the last 1%.
To get those into one token would require the tokenizer to do a bit of parsing to match things up; however, if the tokenizer just determines that it is a token, and passes a value to the parser, so the parser can deal with the values, that would probably be a cleaner implementation.
I'm not sure if there's a notation for values in EBNF, so to invent one for this example, treating
===head==
as "==" TEXT("=head") "==" would be nice, but tricky. as "="(3) TEXT("head") "="(2) would make for a cleaner lexer, and the parser should be able to handle that without too much trouble.