Hi,
I'm going on holiday for the next week. Accordingly, I will not be able to work on the lex/yacc parser that I have written during the past weeks or so. I will check into CVS my work so far, and anyone interested can continue the work while I am away.
So far, the parser can do:
* paragraphs * pre-lines (lines beginning with spaces) * lists (* and # only) * extensions (<math>, <hiero>) * headings * bold and italics
I am sorry I took so long to do bold and italics, but, just as I originally anticipated, it was quite hard. I had discarded two failed attempts until the third one finally worked out. There is one special case in which I had to apply a bit of a hack, but I am sure that this is okay, given that it works pretty much perfectly now.
As for "extensions", it currently recognises anything as an extension that is an HTML tag without attributes and its corresponding closing tag. Using this mechanism, <nowiki> and <pre> can be considered "extensions" for the purposes of the parser.
What is missing:
* links, images, categories (everything in [[ ... ]]) * template inclusions and variables ({{...}} and {{{...}}}) * tables * HTML tags that should be allowed but are not extensions (esp. div)
The lexer already recognises tokens for the former two, but not for tables or HTML tags. In particular, it will recognise something like <b>''something''</b> as an "extension" and not parse the '' as italics. Obviously, this needs to be fixed.
If anything is unclear about how things work, please drop me an e-mail and I will document the relevant bits when I am back.
Timwi