Hello everyone,
I'm interested in porting texvc to Python, and I was hoping this list
here might help me hash out the plan. Please let me know if I should
take my questions elsewhere.
Roughly, my plan of attack would be something like this:
1. Collect test cases and write a testing script
Thanks to avar from #wikimedia, I already have the <math>...</math> bits
from enwiki and dewiki. I would also construct some simpler ones by hand
to test each of the acceptable LaTeX commands.
Would there be any possibility of logging the input seen by texvc on a
production instance of Mediawiki, so I could get some invalid input
submitted by actual users?
This could also be useful to future maintainers for regression testing.
2. Implement an AMS-TeX validator
I'll probably use PLY because it's rumored to have helpful debugging
features (designed for a first-year compilers class, apparently). ANTLR
is another popular option, but this guy
http://www.bearcave.com/software/antlr/antlr_expr.html
thinks it's complicated and hard to debug. I've never used either, so if
anyone on this list knows of a good Python parsing package I'd welcome
suggestions.
3. Port over the existing tex->dvi->png rendering.
This is probably just a few calls into the subprocess module. Yeah, I
just jinxed it.
4. Add HTML rendering to texvc and test script
I don't even understand how the existing texvc decides whether HTML is
good enough. It looks like the original programmer just decreed that
certain LaTeX commands could be rendered to HTML, and defaults to PNG if
it sees anything not on that list. How important is this feature?
5. Repackage the entire Math thing as an extension
I might do this if I have time left at the end. I'm sure the project
will change over the summer.
Python doesn't have parsing just locked right down the way C does with
flex/bison, but there are some good options, I have the most experience
with it, and I think I'd be able to complete the port faster in Python
than in either of the other languages. I was tempted at first to port to
PHP, to conform with the rest of Mediawiki, but there don't seem to be
any good parsing packages for PHP. (Please tell me if that's wrong.)
I'd appreciate any advice or criticism. Since my only previous
experience has been using Wikipedia and setting up a test Mediawiki
instance for my ACM chapter, I'm only just now learning my way around
the code base and it's not always evident why things were done as they
are. Does this look like a reasonable and worthwhile project?
Yours,
Damon Wang
P.S. Some of you may remember me on IRC a couple of days ago getting a
little panicky about not knowing OCaml, but I'm a bit more hopeful now
after looking around the source. I definitely have to keep the OCaml
manual open for reference, but I've written Scheme, Common Lisp, and
Haskell before, so I think I might be able to fake it. These are just
Famous Last Words waiting to happen, I know.