On Thu, 08 Nov 2007 20:26:33 -0500, Simetrical wrote:
On 11/8/07, Steve Sanbeg ssanbeg@ask.com wrote:
I think that's true, if you tokenize correctly, that would go a long way. Unfortunately, there are a few constructs that make tokenization tricky. Apostrophe is the most obvious case; but {'s, and to a lesser extent ['s could have similar problems, since they would require substantial lookahead to tokenize.
According to flex documentation, it's perfectly happy to accept any regex for tokens, and will use unlimited lookahead and backtracking if necessary. It provides debug info allowing you to check for and eliminate backtracking, if you want to speed it up, but that's optional. Clearly it's not possible to tokenize MW markup with one-character lookahead: you sure can't tell the difference between a second- and sixth-level heading, and of course that's even ignoring stuff like ISBN handling that's less basic and more disposable.
But some constructs in MW require an FSM to tokenize, not a regex. Clearly, properly tokenizing bold/italics requires complex processing on an entire paragraph of text. Even templates and links are a little complex, but should be doable by maintaining states with a stack.