[Wikitech-l] Wikitext to HTML translator and Wikitext language specification

13 Oct 2003


      There seems to be a lot of disjoint discussion on Meta about this. Viz:
* There is work that has been done by Taw on an OCAML lexer at
   http://meta.wikipedia.org/wiki/Wikipedia_lexer
* There are some links at
   http://meta.wikipedia.org/wiki/Wikitext_syntax
* A proposal for a radically different Wiki text language at
   http://meta.wikipedia.org/wiki/Wikitax
* A brief take at
   http://meta.wikipedia.org/wiki/Wiki_markup_syntax
* A nearly content-free page at
   http://meta.wikipedia.org/wiki/Wiki_syntax
* A draft XML syntax of Wikitext at
   http://meta.wikipedia.org/wiki/Wikipedia_DTD
Clearly there needs to be some kind of centralized place for work on 
formalizing the language. I would suggest the recently-created
http://meta.wikipedia.org/wiki/Wikitext_standard
Right now what we should work on, is like Ed says, to describe and 
formalize a 1.0 version of the Wikitext language, based on what is used 
currently. In other words this work should not (for right now) involve 
incorporating improvements or changes to the Wikitext language.
Moving on...
First, a couple issues of nomenclature that we should probably get out 
the way:
(1) We need to decide on a name for the wiki markup language or Wiki 
text. I would advocate calling the language "Wikitext" (and calling it 
"The Wikitext language" when usage might be ambiguous, like "C or "The C 
Language"). This seems to be common usage.
(2) A program that converts Wikitext to HTML really consists of three 
(at this point, entirely theoretical) parts: the lexical analyzer, the 
parser, and the (HTML) code generator. Of course, our language is so 
simple and the output language so similar to the input that these steps 
are basically all rolled into one. Nevertheless, calling the whole 
system a 'parser' is not strictly correct. I think 'translator' is more 
accurate, at least from a CS persceptive. I will use the name "Wikitext 
to HTML translator" unless someone comes up with something better.
In addition to a formalization of the language, we also need a 
*reference* implementation of a Wikitext to HTML translator. Right now 
what we have is a de facto reference translator: the functions in 
OutputPage.php. I think most would agree that they're not an ideal 
implementation, but right now, it's the only (proven) complete and 
working implementation of a translator.
The current translator has the following flaws practical and theoretical 
flaws:
(1) It is a little buggy, and, as Neil R. pointed out, there are some 
http://en.wikipedia.org/wiki/User:Marumari/Wikitext_Rendering_Quirks.
(2) It is written in PHP, which is a relatively show scripting language.
(3) It works mainly by using regular expression search-and-replace, 
which can be wildly ineffecient.
(4) And from a theoretical standpoint, it isn't based on any formally 
declared reference grammar for Wikitext, leading to (1).
The ideal translator will:
(1) be written so that is is very efficient, either in PHP or a compiled 
language like C or C++.
(2) Be portable and embeddable in a variety of language environments.
(3) Be an example of well-written code generally.
Other thoughts I couldn't find a good place for above:
* A translator written using Lex and Yacc would be a C translator, as 
that is the output language of those tools. I think using Lex and Yacc 
or similar tools would be a good approach because it would mean making 
alterations to the language relatively easy to implement.
* The SWIG interface compiler http://www.swig.org can be used to 
compile C or C++ directly into PHP and can be called with normal PHP 
function calls. If a C or C++ translator is used and efficiency of the 
translator becomes a major performance concern, then using SWIG to 
compile the translator directly into PHP would be probably the most 
efficient way to use it. Swig can also compile C and C++ into modules 
for Perl, Python, Tcl, Ruby, Java, and some other languages.
* Obviously, for usability purposes, we have decided not to use a 
XML-compatible language. That is fine. However, given the ubiquity of 
XML and tools to manipulate it, I think it is desirable to have a 
canonical translation between Wikitext and XML. Having a XML translation 
of Wikitext would allow better interoperation between Wikitext documents 
and other systems. Also, the conversion from XML->HTML could be handled 
by standardized software and technologies, like XSLT. I recognize 
current implementations of these standards are lacking in some areas, 
but in the long term they may be the best solution. For now, I think, 
there is no reason not to just focus on making a good Wikitext to HTML 
translator.
* We can have a competition of sorts to pick the best implementation of 
a Wikitext->HTML translator and declare that the 1.0 reference translator.
* As Neil H. said, there should be a way for translators "to be 
validated as correct, by allowing the compilation of a set of unit tests"
I will put most of this content on meta, but I thought I should post it 
to the mailing list to stir up interest in a way that can be put to good 
use.
- David [[User: Nohat]]

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] Wikitext to HTML translator and Wikitext language specification