On 28/03/10 18:59, Aryeh Gregor wrote:
On Fri, Mar 26, 2010 at 10:48 PM, Damon Wangdamonwang@uchicago.edu wrote:
(You also as a Mediawiki extension rather than a core feature; I'm going to do that, but I won't say anything more because it seems fairly uncontroversial.)
I actually disagree with this pretty strongly. It would be a regression in functionality for existing users -- if they upgrade, their wiki breaks unless they install a new extension. There's no reason to remove it from core that I see that outweighs this disadvantage.
Since the subset of TeX you need parsed has a context-free grammar, it needs an LALR parser, not just a bunch of regexes. I know three ways to get an LALR parser:
(1) write a pushdown automaton manually (i.e., be yacc) (2) write input for a parser-generator (3) write a parser-generator, and give it input
Option (2) is the most maintainable and feasible option, and it's precisely the one that cannot be done in PHP. As far as I know, PHP has no parser-generator package. (Please, please let me know if that's incorrect so I can stop embarrassing myself and get on with writing a GSoC proposal.)
I could probably do (1), or some hackish kludge at half of it, by throwing custom control structures into a bucketload of regexes, but I don't think that's in the project's best interests. As has been pointed out, the OCaml implementation is really concise and elegant. A large fraction of that concision and elegance comes from not actually being a parser but rather only a context-free grammar written in a BNF-like syntax common to most parser-generators.
Okay, well, maybe you're right. I'd be interested to hear Tim Starling's opinion on this (using parser generators vs. writing by hand). Writing it in Python would certainly be a big step forward from OCaml -- any site with LaTeX accessible to MediaWiki will almost certainly have Python available, so Python vs. PHP should make no difference to end-users. And Python is probably the second-best-known language among MediaWiki hackers.
Have you had a look at pyparsing, which is a ready-made all-singing-all-dancing Python parser package with a large amount of syntactic sugar built in to allow the more-or-less direct input of grammar notations?
Given that the texvc source already has a grammar encoded into it in machine-executable form, it might be an idea to consider mechanically extract that grammar from the texvc OCaml source, and then reformatting it into a grammar in pyparsing's natural format.
-- Neil