Gregory Szorc wrote:
On 9/29/06, Magnus Manske
<magnusmanske(a)googlemail.com> wrote:
I've been among those writing parsers (many
half-baken ones;-) and
IMHO the only viable option, in the long run, is to make an abstract
grammar (whatever style) and have parser generators for many languages
implement them.
I agree that down the road a formal grammar should be adopted, but first
thing is first: separate the parser. I would love to see the MediaWiki
parser become something like Radeox (
http://radeox.org/space/start). This
rendering engine is used by Confluence, XWiki, and others. It is currently
only written in Java, but that is fine. The MediaWiki parser would only
initially be available in PHP. This is much better than it only being
available in MediaWiki.
Also, the parser could still be maintained by the MediaWiki team. They
would not have to give up control of the parser or their vision for it.
They only change is the parser could stand on its own and its power and
popular syntax could be utilized by scores of other (PHP) wikis.
On another positive note, the decoupling of the parser would also bring a
great opportunity to fix any quirks with the current parser, including
rendering issues.
A parser that performs a subset of the native MediaWiki parser is entirely
possible, and has been done several times before, but a complete decoupling
is rather more challenging. I imagine it would be rather like the separation
between the Zend Engine and PHP. Features such as the following rely on
diverse parts of the MediaWiki framework and would have to be dealt with by
hooks or callbacks:
* link colouring
* interlanguage link recognition
* URL generation
* template text fetch
* image rendering
* double-underscore properties such as __NEWSECTIONLINK__
* core parser functions
* variables, e.g. {{NUMBEROFARTICLES}}
* language conversion
* extensions
Some of these items could be deferred, by having the parser output an
intermediate representation which can then be converted to HTML by a
feature-rich output phase. But that doesn't exempt you from writing that
output phase, if you want the parser to be useful for anything at all. Few
people realise what a large proportion of the MediaWiki codebase is accessed
by the present parser module.
I'm in favour of a C/C++ module closely coupled with the existing PHP
framework, to speed up wikitext to HTML transformation. I can also see that
feature-reduced parsers may be occasionally useful, such as an embeddable
PHP parser along the lines of Gregory's original post. But for
fully-featured wikitext to HTML conversion, including access to
MediaWiki-specific features like those listed above, the parser has to be
coupled with MediaWiki itself.
It may be possible to decouple the parser, like Gregory suggests, and to add
MediaWiki-specific features back in with callbacks or post-processing.
However it would be a lot of work, and any performance losses due to the
abstraction would have to be offset with gains elsewhere, if Wikimedia is
going to buy in. You might be better off just using an independent
feature-reduced parser like PEAR's Text_Wiki_Mediawiki.
-- Tim Starling