Emancipate the Parser - MediaWiki-l

jeremiah johnson

28 Sep 28 Sep

4:23 p.m.

New subject: [Mediawiki-l] Emancipate the Parser

I think this is an awesome idea. Especially since I spent all day writing a parser in Java. I work at an organization where using PHP is not allowed, so this is my only option. I don't even know PHP well enough to port MediaWiki's parser. If anyone knows which PHP classes (is 'classes' an appropriate term for PHP?) contain the parsing logic, I can port it, then open source the parser. If anyone is interested, please let me know. I'm writing a parser for the small amount of markup my organization would use, but I'd be quite happy to port the MediaWiki parser to Java at home. jeremiah(); On 9/28/06, Gregory Szorc <gregory.szorc(a)gmail.com> wrote:

...

Has it ever been considered to separate the MediaWiki markup parser from the core MediaWiki project? It seems to me that if the parser stood on its own, it would help wiki adoption by allowing others to use the same syntax as the most popular wikis in the world (Wikipedia). As MediaWiki has pledged support for Creole, it seems that eventually the wiki parsing in MediaWiki will have to be converted to accomodate a common interface that works for both MediaWiki and Creole, so why not use this opportunity to free out the core parser? Gregory Szorc gregory.szorc(a)case.edu _______________________________________________ MediaWiki-l mailing list MediaWiki-l(a)Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/mediawiki-l

Reply

Kasimir Gabert

9:46 a.m.

New subject: [Mediawiki-l] Emancipate the Parser

Hm... When I previewed this in MW 1.7.1, I got: bold <a href="/w/index.php?title=Link&action=edit" class="new" title="Link">bold not bold italics</a> still italics Which is completely valid, and exactly what is wanted by the user. I am not sure what the problem is... On 9/29/06, Magnus Manske <magnusmanske(a)googlemail.com> wrote:

...

I've been among those writing parsers (many half-baken ones;-) and IMHO the only viable option, in the long run, is to make an abstract grammar (whatever style) and have parser generators for many languages implement them. This would also be an improvement over the current parser, which renders '''bold [[link|bold ''' not bold ''italics]] still italics as bold <a href="/mediawiki/index.php?title=Link">bold not bold italics</a> still italics which is about as broken as it gets, and only tidyhtml saves wikipedia from serving such embarrassments. Magnus On 9/28/06, jeremiah johnson <naikrovek(a)gmail.com> wrote:

I think this is an awesome idea. Especially since I spent all day writing a parser in Java. I work at an organization where using PHP is not allowed, so this is my only option. I don't even know PHP well enough to port MediaWiki's parser. If anyone knows which PHP classes (is 'classes' an appropriate term for PHP?) contain the parsing logic, I can port it, then open source the parser. If anyone is interested, please let me know. I'm writing a parser for the small amount of markup my organization would use, but I'd be quite happy to port the MediaWiki parser to Java at home. jeremiah(); On 9/28/06, Gregory Szorc <gregory.szorc(a)gmail.com> wrote:

Has it ever been considered to separate the MediaWiki markup parser from the core MediaWiki project? It seems to me that if the parser stood on its own, it would help wiki adoption by allowing others to use the same syntax as the most popular wikis in the world (Wikipedia). As MediaWiki has pledged support for Creole, it seems that eventually the wiki parsing in MediaWiki will have to be converted to accomodate a common interface that works for both MediaWiki and Creole, so why not use this opportunity to free out the core parser? Gregory Szorc gregory.szorc(a)case.edu _______________________________________________ MediaWiki-l mailing list MediaWiki-l(a)Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/mediawiki-l

_______________________________________________ MediaWiki-l mailing list MediaWiki-l(a)Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/mediawiki-l

-- Kasimir Gabert

Reply

Tim Starling

10:24 p.m.

New subject: [Mediawiki-l] Emancipate the Parser

Gregory Szorc wrote:

...

On 9/29/06, Magnus Manske <magnusmanske(a)googlemail.com> wrote:

I've been among those writing parsers (many half-baken ones;-) and IMHO the only viable option, in the long run, is to make an abstract grammar (whatever style) and have parser generators for many languages implement them.

I agree that down the road a formal grammar should be adopted, but first thing is first: separate the parser. I would love to see the MediaWiki parser become something like Radeox (http://radeox.org/space/start). This rendering engine is used by Confluence, XWiki, and others. It is currently only written in Java, but that is fine. The MediaWiki parser would only initially be available in PHP. This is much better than it only being available in MediaWiki. Also, the parser could still be maintained by the MediaWiki team. They would not have to give up control of the parser or their vision for it. They only change is the parser could stand on its own and its power and popular syntax could be utilized by scores of other (PHP) wikis. On another positive note, the decoupling of the parser would also bring a great opportunity to fix any quirks with the current parser, including rendering issues.

A parser that performs a subset of the native MediaWiki parser is entirely possible, and has been done several times before, but a complete decoupling is rather more challenging. I imagine it would be rather like the separation between the Zend Engine and PHP. Features such as the following rely on diverse parts of the MediaWiki framework and would have to be dealt with by hooks or callbacks: * link colouring * interlanguage link recognition * URL generation * template text fetch * image rendering * double-underscore properties such as __NEWSECTIONLINK__ * core parser functions * variables, e.g. {{NUMBEROFARTICLES}} * language conversion * extensions Some of these items could be deferred, by having the parser output an intermediate representation which can then be converted to HTML by a feature-rich output phase. But that doesn't exempt you from writing that output phase, if you want the parser to be useful for anything at all. Few people realise what a large proportion of the MediaWiki codebase is accessed by the present parser module. I'm in favour of a C/C++ module closely coupled with the existing PHP framework, to speed up wikitext to HTML transformation. I can also see that feature-reduced parsers may be occasionally useful, such as an embeddable PHP parser along the lines of Gregory's original post. But for fully-featured wikitext to HTML conversion, including access to MediaWiki-specific features like those listed above, the parser has to be coupled with MediaWiki itself. It may be possible to decouple the parser, like Gregory suggests, and to add MediaWiki-specific features back in with callbacks or post-processing. However it would be a lot of work, and any performance losses due to the abstraction would have to be offset with gains elsewhere, if Wikimedia is going to buy in. You might be better off just using an independent feature-reduced parser like PEAR's Text_Wiki_Mediawiki. -- Tim Starling

Reply