On Tue, Oct 14, 2003 at 11:16:19AM +1300, Richard Grevers wrote:
On Mon, 13 Oct 2003 17:21:15 -0400, David Friedland david@nohat.net gave utterance to the following:
There seems to be a lot of disjoint discussion on Meta about this. Viz:
- There is work that has been done by Taw on an OCAML lexer at http://meta.wikipedia.org/wiki/Wikipedia_lexer
My suggestions would be "the broken wikitext language", or the "invalid wikitext language". Because of its UseMod ancestry, the current parser produces some very bad HTML code*, and in particular handles lists and nesting of blocks really badly.
- not so bad if HTML 3.2 or 4 is our target, but it would be nice to be
able to produce clean XHTML. A few months back I started work on a ValidWiki parser, which has a much stronger concept of block and line elements, and uses both block and line stacks to open and close all elements correctly. I think I'm about 2/3 of the way through the block parser, and hadn't yet written the line parser. I have no idea how the code would comapre for efficiency. Unfortunately the only language I know how to code in is MivaScript, so it would need porting. (Miva performs okay for your mid-level merchant application, but doesn't have the efficiency for something with the workload of Wikipedia.
Uhm, my parser has block stack + line stack architecture too. But the sources at http://meta.wikipedia.org/wiki/Wikipedia_lexer aren't the most recent.
Newer sources attached.
It's not complete but it wasn't really meant to be. It was meant to be a proof of concept that a mix of wiki markup and HTML can be parsed in a XHTML-correct and DWIM way extremely efficiently. Concept proven, but integrating the parser with the rest of Wikipedia would take much more time than I'm willing to spend right now.