On 11/8/07, Steve Bennett stevagewp@gmail.com wrote:
I think it would be a good idea to formalise and improve the grammar so that wasn't the case. Does any sane grammar need more than one token look ahead?
Certainly, unless your definition of "sane" is very narrow. I believe neither C++ nor Perl have LALR(1) grammars. I saw at least one syntax suggestion for Python one time that was rejected on the basis of requiring multi-token lookahead.
On 11/8/07, Steve Bennett stevagewp@gmail.com wrote:
On 11/9/07, Simetrical Simetrical+wikilist@gmail.com wrote:
Clearly it's not possible to tokenize MW markup with one-character lookahead: you sure can't tell the difference between a second- and sixth-level heading, and of course that's even ignoring
Yes you can, if ====== is a token. Which at first glance, it should be. The fact that == looks like === looks like ==== is neither here nor there to the grammar - it's a handy mnemonic for humans, that's all.
The point is that if those are different token types (rather than the same token type -- which I guess they could be), you can't tokenize them without lookahead. Or at least I don't think you can: maybe I misunderstand lookahead. I guess it doesn't require backtracking, regardless.
Certainly apostrophes require more than one character lookahead, and backtracking.
What's wrong with ISBN handling? I don't see anything problematic in an "ISBN" token that consumes a following sequence of digits, possibly with hyphens and crap.
ISBN 123456789X is parsed as an ISBN. ISBN 123456789 is not, because it doesn't have enough digits. That means you need quite a lot of lookahead and backtracking for ISBNs, at least in the tokenizer. Which was my point, the tokenizer will need to be able to backtrack. It's not a big issue, I don't think, judging by the flex docs, which was the reason for my post: responding to Steve Sanbeg's remark about how much lookahead is needed by the tokenizer.