There seems to be a lot of disjoint discussion on Meta about this. Viz:
* There is work that has been done by Taw on an OCAML lexer at http://meta.wikipedia.org/wiki/Wikipedia_lexer * There are some links at http://meta.wikipedia.org/wiki/Wikitext_syntax * A proposal for a radically different Wiki text language at http://meta.wikipedia.org/wiki/Wikitax * A brief take at http://meta.wikipedia.org/wiki/Wiki_markup_syntax * A nearly content-free page at http://meta.wikipedia.org/wiki/Wiki_syntax * A draft XML syntax of Wikitext at http://meta.wikipedia.org/wiki/Wikipedia_DTD
Clearly there needs to be some kind of centralized place for work on formalizing the language. I would suggest the recently-created http://meta.wikipedia.org/wiki/Wikitext_standard
Right now what we should work on, is like Ed says, to describe and formalize a 1.0 version of the Wikitext language, based on what is used currently. In other words this work should not (for right now) involve incorporating improvements or changes to the Wikitext language.
Moving on...
First, a couple issues of nomenclature that we should probably get out the way:
(1) We need to decide on a name for the wiki markup language or Wiki text. I would advocate calling the language "Wikitext" (and calling it "The Wikitext language" when usage might be ambiguous, like "C or "The C Language"). This seems to be common usage.
(2) A program that converts Wikitext to HTML really consists of three (at this point, entirely theoretical) parts: the lexical analyzer, the parser, and the (HTML) code generator. Of course, our language is so simple and the output language so similar to the input that these steps are basically all rolled into one. Nevertheless, calling the whole system a 'parser' is not strictly correct. I think 'translator' is more accurate, at least from a CS persceptive. I will use the name "Wikitext to HTML translator" unless someone comes up with something better.
In addition to a formalization of the language, we also need a *reference* implementation of a Wikitext to HTML translator. Right now what we have is a de facto reference translator: the functions in OutputPage.php. I think most would agree that they're not an ideal implementation, but right now, it's the only (proven) complete and working implementation of a translator.
The current translator has the following flaws practical and theoretical flaws:
(1) It is a little buggy, and, as Neil R. pointed out, there are some http://en.wikipedia.org/wiki/User:Marumari/Wikitext_Rendering_Quirks. (2) It is written in PHP, which is a relatively show scripting language. (3) It works mainly by using regular expression search-and-replace, which can be wildly ineffecient. (4) And from a theoretical standpoint, it isn't based on any formally declared reference grammar for Wikitext, leading to (1).
The ideal translator will: (1) be written so that is is very efficient, either in PHP or a compiled language like C or C++. (2) Be portable and embeddable in a variety of language environments. (3) Be an example of well-written code generally.
Other thoughts I couldn't find a good place for above:
* A translator written using Lex and Yacc would be a C translator, as that is the output language of those tools. I think using Lex and Yacc or similar tools would be a good approach because it would mean making alterations to the language relatively easy to implement.
* The SWIG interface compiler http://www.swig.org can be used to compile C or C++ directly into PHP and can be called with normal PHP function calls. If a C or C++ translator is used and efficiency of the translator becomes a major performance concern, then using SWIG to compile the translator directly into PHP would be probably the most efficient way to use it. Swig can also compile C and C++ into modules for Perl, Python, Tcl, Ruby, Java, and some other languages.
* Obviously, for usability purposes, we have decided not to use a XML-compatible language. That is fine. However, given the ubiquity of XML and tools to manipulate it, I think it is desirable to have a canonical translation between Wikitext and XML. Having a XML translation of Wikitext would allow better interoperation between Wikitext documents and other systems. Also, the conversion from XML->HTML could be handled by standardized software and technologies, like XSLT. I recognize current implementations of these standards are lacking in some areas, but in the long term they may be the best solution. For now, I think, there is no reason not to just focus on making a good Wikitext to HTML translator.
* We can have a competition of sorts to pick the best implementation of a Wikitext->HTML translator and declare that the 1.0 reference translator.
* As Neil H. said, there should be a way for translators "to be validated as correct, by allowing the compilation of a set of unit tests"
I will put most of this content on meta, but I thought I should post it to the mailing list to stir up interest in a way that can be put to good use.
- David [[User: Nohat]]