On 03/23/2010 08:06 AM, Damon Wang wrote:
Hello everyone,
I'm interested in porting texvc to Python, and I was hoping this list
here might help me hash out the plan. Please let me know if I should
take my questions elsewhere.
Roughly, my plan of attack would be something like this:
1. Collect test cases and write a testing script
Thanks to avar from #wikimedia, I already have the <math>...</math> bits
from enwiki and dewiki. I would also construct some simpler ones by hand
to test each of the acceptable LaTeX commands.
It is not too challenging to create a test file that checks most
existing commands with some hacky regexes on the existing parser (I
can't find where mine has gone though); what is much harder is proving
that your script cannot ever let through invalid or potentially harmful
LaTeX (the moment anyone gets a \catcode past you, you're doomed, and
there are many other commands that are "not wanted" to say the least).
Obviously, the current implementation makes this reasonably pleasant to
verify, the syntax of the parser is exceedingly light, so any
reimplementation should strive to have as little syntactic overhead as
possible.
2. Implement an AMS-TeX validator
How different would this be from the current validator?
3. Port over the existing tex->dvi->png
rendering.
This is probably just a few calls into the subprocess module. Yeah, I
just jinxed it.
4. Add HTML rendering to texvc and test script
I don't even understand how the existing texvc decides whether HTML is
good enough. It looks like the original programmer just decreed that
certain LaTeX commands could be rendered to HTML, and defaults to PNG if
it sees anything not on that list. How important is this feature?
I am not too fussed about the HTML output, though I can't speak for
everyone, at the moment it seems that many more of the Unicode
characters should be let through (at least at some level of HTML),
though I don't know enough about worldwide unicode support. Some things,
like \sqrt for example, are pretty hard to render nicely in HTML, so
images are still sensible for some expressions.
5. Repackage the entire Math thing as an extension
I might do this if I have time left at the end. I'm sure the project
will change over the summer.
This would be very amazing.
Python doesn't have parsing just locked right down
the way C does with
flex/bison, but there are some good options, I have the most experience
with it, and I think I'd be able to complete the port faster in Python
than in either of the other languages. I was tempted at first to port to
PHP, to conform with the rest of Mediawiki, but there don't seem to be
any good parsing packages for PHP. (Please tell me if that's wrong.)
A good PHP parser library would be exceptionally useful for MediaWiki
(and many extensions), at the moment we have loads of methods that do
regex "parsing", so if you felt like writing one... :D.
I'd appreciate any advice or criticism. Since my only previous
experience has been using Wikipedia and setting up a test Mediawiki
instance for my ACM chapter, I'm only just now learning my way around
the code base and it's not always evident why things were done as they
are. Does this look like a reasonable and worthwhile project?
Step 5. has been a "we really should do this" for a while, the shipping
of OCaml code which many users won't be able to use is very messy. I am
less convinced of the utility of a Python port, OCaml is a great
language for implementing this, and I fear a lot of your time would be
wasted trying to make the Python similarly nice. As you note, MediaWiki
is not written in Python, doing this in PHP would be a larger step in
the right direction, though without such nice frameworks, maybe less
nice to do.
Instead of rewriting the <math> parser, it might be more productive to
create parsers for some of the other languages that extensions use,
hopefully with a view to adding additional extensions to Wikipedia. The
ones I can think of immediately are <chem> tags (bug 3252/5856),
<gnuplot>, <lilypond>/<ABC> (bug 189!), <graphviz> (bug 2403).
Yours
Conrad
(PS. I'm no-one official, so can be ignored safely)