Quoting Erik Moeller <erik_moeller(a)gmx.de>de>:
What I don't get is why you want to parse the raw
wikitext. This will be
quite a PITA unless you also bundle PHP, texvc etc. It would be much
easier to use the existing parser code to create a static HTML dump from
the wikisource, and use something like swish-E (
www.swish-e.org ) to
index that HTML dump. This would give you more time to focus on what
actually matters, i.e. the user interface.
This seems like a pretty hackish solution long-term. The HTML dump has some
semantic information, but it also has a lot of HTML-ish cruft in it. The
wikitext doesn't have all the semantic information anyone might want, but it's
much better than the HTML version. If you're going to do anything reasonably
intelligent with the output (other than just display the rendered HTML), or
output it to some different format (TeX is the one I've been working on), and
want it automated, a lot of that information will be useful.
So, basically:
Wikitext --> abstract syntax --> a presentation format (HTML, TeX, etc.)
instead of:
Wikitext --> one presentation format (HTML) --> another presentation format
...seems better to me.
The latter version is sort of like compiling a C++ program into x86 assembly and
then transforming it into PowerPC assembly from that, rather than doing wha gcc
does--compiling C++ into an abstract intermediate representation, which can
then be output to x86 assembly or PowerPC assembly or whatever you might like.
It does bring up another point though: even in the wikitext there isn't as much
semantic information as might be nice. Some is hard to come up with good
markup for, but some is fairly easy--for example, encouraging people to use
<math> tags for everying that's logically math, even short things like
"the
variable <math>g</math> is..." instead of using manual non-logical
formatting
commands like "italicize". Or even worse, using HTML-specific stuff like fancy
divs.
-Mark
----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.