On Fri, Mar 26, 2010 at 10:48 PM, Damon Wang damonwang@uchicago.edu wrote:
(You also as a Mediawiki extension rather than a core feature; I'm going to do that, but I won't say anything more because it seems fairly uncontroversial.)
I actually disagree with this pretty strongly. It would be a regression in functionality for existing users -- if they upgrade, their wiki breaks unless they install a new extension. There's no reason to remove it from core that I see that outweighs this disadvantage.
Since the subset of TeX you need parsed has a context-free grammar, it needs an LALR parser, not just a bunch of regexes. I know three ways to get an LALR parser:
(1) write a pushdown automaton manually (i.e., be yacc) (2) write input for a parser-generator (3) write a parser-generator, and give it input
Option (2) is the most maintainable and feasible option, and it's precisely the one that cannot be done in PHP. As far as I know, PHP has no parser-generator package. (Please, please let me know if that's incorrect so I can stop embarrassing myself and get on with writing a GSoC proposal.)
I could probably do (1), or some hackish kludge at half of it, by throwing custom control structures into a bucketload of regexes, but I don't think that's in the project's best interests. As has been pointed out, the OCaml implementation is really concise and elegant. A large fraction of that concision and elegance comes from not actually being a parser but rather only a context-free grammar written in a BNF-like syntax common to most parser-generators.
Okay, well, maybe you're right. I'd be interested to hear Tim Starling's opinion on this (using parser generators vs. writing by hand). Writing it in Python would certainly be a big step forward from OCaml -- any site with LaTeX accessible to MediaWiki will almost certainly have Python available, so Python vs. PHP should make no difference to end-users. And Python is probably the second-best-known language among MediaWiki hackers.
I think it'd be easier to find a programmer who has worked with a parser-generator and can learn a little bit of OCaml, than it would be to find a PHP programmer who has to read himself into a manually implemented parser. After all, how many PHP programmers do you know who have experience mucking around inside an LALR parser?
The parsing part is unlikely to need much maintenance. There are other things currently in OCaml that make more sense to modify from time to time -- like the whitelist of commands, and (some of?) the code for non-image output formats. So for instance, MathML output is theoretically supported, but I don't know how good the support is. That might become more important in the future, since Firefox is likely to support inline MathML in text/html not too long from now. This sort of thing would be harder if it were Python rather than PHP.
I don't think it would be a big deal if it were rewritten entirely in Python, though. It would be a big step forward in any case, and if it's easier for you, great. So personally I'd be okay with it, although it's perhaps not ideal.
Also, would anyone be interested in mentoring this project?
I probably wouldn't be of any help for this particular project, since I don't know anything about parsers, and my Python and TeX are passable but not great. We could probably come up with a mentor, though.