There's a few Python-based things that might be interesting, but I think you'll get a lot more love for doing something in PHP or C. Since this is a student internship, you shouldn't be bashful about using this as a learning opportunity.
I'd only caution against convincing yourself (and us) that you'll be more interested in learning something like PHP than you truly are. It might help you land a spot, but it will work against you in having a successful project, and
this has such high visibility that you'll really want to be successful.
What visibility does this have? I thought it was some abandoned corner of the wiki that nobody has touched in the seven years since it was first written. What happens if I make a hash of this?
So, if you find yourself thinking about doing this in PHP and having your inner voice say "meh", then I'd recommend sticking to your guns and propose doing this or something else in Python and/or C.
Well, now my inner voice says, "I really don't want to make a hash of this texvc port!", so let me explain why I want to do it in Python rather than PHP. I agree that the performance will probably be just fine, and that it would be a great coup for maintainability and installation and usage. The problem is, I don't think PHP has a parser-generator package.
So let me make sure I understand the problem here. You already have a texvc implementation that has worked just fine for the last seven years. TeX is pretty stable at this point, so chances are good you'd make it another seven years without problems. But you're still dissatisfied because OCaml is a hard language to find programmers for, and the existing implementation isn't really maintained. You want it ported to a different language that has more programmers available.
(You also as a Mediawiki extension rather than a core feature; I'm going to do that, but I won't say anything more because it seems fairly uncontroversial.)
Since the subset of TeX you need parsed has a context-free grammar, it needs an LALR parser, not just a bunch of regexes. I know three ways to get an LALR parser:
(1) write a pushdown automaton manually (i.e., be yacc) (2) write input for a parser-generator (3) write a parser-generator, and give it input
Option (2) is the most maintainable and feasible option, and it's precisely the one that cannot be done in PHP. As far as I know, PHP has no parser-generator package. (Please, please let me know if that's incorrect so I can stop embarrassing myself and get on with writing a GSoC proposal.)
I could probably do (1), or some hackish kludge at half of it, by throwing custom control structures into a bucketload of regexes, but I don't think that's in the project's best interests. As has been pointed out, the OCaml implementation is really concise and elegant. A large fraction of that concision and elegance comes from not actually being a parser but rather only a context-free grammar written in a BNF-like syntax common to most parser-generators.
I think it'd be easier to find a programmer who has worked with a parser-generator and can learn a little bit of OCaml, than it would be to find a PHP programmer who has to read himself into a manually implemented parser. After all, how many PHP programmers do you know who have experience mucking around inside an LALR parser?
So that's why, while I'm happy to take it on in PHP as a learning experience for myself, I think it'd be better for Mediawiki to port texvc to Python. That gets us the larger pool of potential maintainers that comes with using a commonly known language, without sacrificing the amazing advantage of only needing to maintain a grammar rather than the parser itself.
And as far as dependencies are concerned, Python is still a much easier dependency to satisfy, both for programmers working with the code and for sysadmins installing it.
What do you guys think?
Also, would anyone be interested in mentoring this project?
Yours, Damon Wang