Quoting Erik Moeller erik_moeller@gmx.de:
What I don't get is why you want to parse the raw wikitext. This will be quite a PITA unless you also bundle PHP, texvc etc. It would be much easier to use the existing parser code to create a static HTML dump from the wikisource, and use something like swish-E ( www.swish-e.org ) to index that HTML dump. This would give you more time to focus on what actually matters, i.e. the user interface.
This seems like a pretty hackish solution long-term. The HTML dump has some semantic information, but it also has a lot of HTML-ish cruft in it. The wikitext doesn't have all the semantic information anyone might want, but it's much better than the HTML version. If you're going to do anything reasonably intelligent with the output (other than just display the rendered HTML), or output it to some different format (TeX is the one I've been working on), and want it automated, a lot of that information will be useful.
So, basically: Wikitext --> abstract syntax --> a presentation format (HTML, TeX, etc.)
instead of: Wikitext --> one presentation format (HTML) --> another presentation format
...seems better to me.
The latter version is sort of like compiling a C++ program into x86 assembly and then transforming it into PowerPC assembly from that, rather than doing wha gcc does--compiling C++ into an abstract intermediate representation, which can then be output to x86 assembly or PowerPC assembly or whatever you might like.
It does bring up another point though: even in the wikitext there isn't as much semantic information as might be nice. Some is hard to come up with good markup for, but some is fairly easy--for example, encouraging people to use <math> tags for everying that's logically math, even short things like "the variable <math>g</math> is..." instead of using manual non-logical formatting commands like "italicize". Or even worse, using HTML-specific stuff like fancy divs.
-Mark
---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program.