Magnus Manske wrote:
Or we could use one of these weird compiler-generating languages, or parser-generating ones, if there is such a thing. The point is, it will be spearate (read: independent) from the rest of the software, which will simplify things enormously, IMHO.
Personally, I am very much in favour of using such a parser generator. Some time ago, someone here on this mailing list already proposed this, but I can't find it now. One major advantage of this would be that we can extend the grammar and have the parser be generated from it. We would no longer have to actually tweak the parser. Other advantages included efficiency, and simply the assurance that this is the "correct way", because all other professional applications do it this way.
The process is usually broken down into four phases:
(1) lexing -- turn raw text into series of tokens (2) parsing -- turn series of tokens into parse tree (3) processing (4) compiling -- turn processed parse tree into requested output format
This is extremely general; this whole procedure can apply to programming language compilers (e.g. gcc), markup processors (e.g. browsers, LaTeX) and pretty much anything else that turns a text file in one format into some other format (not necessarily text: in the case of a compiler, it would be executable code). Because of this generality, many tools to perform these tasks already exist. In the case of step 1, this is what "lex" does. Step 2 is the field of expertise of parser generators such as "yacc" or its free-software equivalent "bison". These are C-centric in the sense that they output C code; I'm sure PHP ones exist, but maybe we want to use C for efficiency anyway. Steps 3 and 4 are application-dependent, so they are programmed manually, but given a parse tree, they are easy.
The "process" step is particularly application-dependent; in the case of a programming language compiler, for example, it might perform optimisations. In our case, it means:
(a) find template inclusions, recursively call this entire process with the template's wiki text and replace the template inclusion with the resulting parse tree; (b) find links and determine if the page they point to is non-existent, a stub, etc., and "annotate" the parse tree accordingly; (c) probably other little things I haven't thought of.
I would be more than willing to help with this, especially steps 3 and 4 :-), but since I have absolutely no experience with lex or bison, I would need some help with those.
Have I mentioned yet that this is the only correct way to do this? :-)
Timwi