Hello everyone,
I'm interested in porting texvc to Python, and I was hoping this list here might help me hash out the plan. Please let me know if I should take my questions elsewhere.
Roughly, my plan of attack would be something like this:
1. Collect test cases and write a testing script Thanks to avar from #wikimedia, I already have the <math>...</math> bits from enwiki and dewiki. I would also construct some simpler ones by hand to test each of the acceptable LaTeX commands.
Would there be any possibility of logging the input seen by texvc on a production instance of Mediawiki, so I could get some invalid input submitted by actual users?
This could also be useful to future maintainers for regression testing.
2. Implement an AMS-TeX validator I'll probably use PLY because it's rumored to have helpful debugging features (designed for a first-year compilers class, apparently). ANTLR is another popular option, but this guy http://www.bearcave.com/software/antlr/antlr_expr.html thinks it's complicated and hard to debug. I've never used either, so if anyone on this list knows of a good Python parsing package I'd welcome suggestions.
3. Port over the existing tex->dvi->png rendering. This is probably just a few calls into the subprocess module. Yeah, I just jinxed it.
4. Add HTML rendering to texvc and test script I don't even understand how the existing texvc decides whether HTML is good enough. It looks like the original programmer just decreed that certain LaTeX commands could be rendered to HTML, and defaults to PNG if it sees anything not on that list. How important is this feature?
5. Repackage the entire Math thing as an extension I might do this if I have time left at the end. I'm sure the project will change over the summer.
Python doesn't have parsing just locked right down the way C does with flex/bison, but there are some good options, I have the most experience with it, and I think I'd be able to complete the port faster in Python than in either of the other languages. I was tempted at first to port to PHP, to conform with the rest of Mediawiki, but there don't seem to be any good parsing packages for PHP. (Please tell me if that's wrong.)
I'd appreciate any advice or criticism. Since my only previous experience has been using Wikipedia and setting up a test Mediawiki instance for my ACM chapter, I'm only just now learning my way around the code base and it's not always evident why things were done as they are. Does this look like a reasonable and worthwhile project?
Yours, Damon Wang
P.S. Some of you may remember me on IRC a couple of days ago getting a little panicky about not knowing OCaml, but I'm a bit more hopeful now after looking around the source. I definitely have to keep the OCaml manual open for reference, but I've written Scheme, Common Lisp, and Haskell before, so I think I might be able to fake it. These are just Famous Last Words waiting to happen, I know.
On 03/23/2010 08:06 AM, Damon Wang wrote:
Hello everyone,
I'm interested in porting texvc to Python, and I was hoping this list here might help me hash out the plan. Please let me know if I should take my questions elsewhere.
Roughly, my plan of attack would be something like this:
- Collect test cases and write a testing script
Thanks to avar from #wikimedia, I already have the <math>...</math> bits from enwiki and dewiki. I would also construct some simpler ones by hand to test each of the acceptable LaTeX commands.
It is not too challenging to create a test file that checks most existing commands with some hacky regexes on the existing parser (I can't find where mine has gone though); what is much harder is proving that your script cannot ever let through invalid or potentially harmful LaTeX (the moment anyone gets a \catcode past you, you're doomed, and there are many other commands that are "not wanted" to say the least). Obviously, the current implementation makes this reasonably pleasant to verify, the syntax of the parser is exceedingly light, so any reimplementation should strive to have as little syntactic overhead as possible.
- Implement an AMS-TeX validator
How different would this be from the current validator?
- Port over the existing tex->dvi->png rendering.
This is probably just a few calls into the subprocess module. Yeah, I just jinxed it. 4. Add HTML rendering to texvc and test script I don't even understand how the existing texvc decides whether HTML is good enough. It looks like the original programmer just decreed that certain LaTeX commands could be rendered to HTML, and defaults to PNG if it sees anything not on that list. How important is this feature?
I am not too fussed about the HTML output, though I can't speak for everyone, at the moment it seems that many more of the Unicode characters should be let through (at least at some level of HTML), though I don't know enough about worldwide unicode support. Some things, like \sqrt for example, are pretty hard to render nicely in HTML, so images are still sensible for some expressions.
- Repackage the entire Math thing as an extension
I might do this if I have time left at the end. I'm sure the project will change over the summer.
This would be very amazing.
Python doesn't have parsing just locked right down the way C does with flex/bison, but there are some good options, I have the most experience with it, and I think I'd be able to complete the port faster in Python than in either of the other languages. I was tempted at first to port to PHP, to conform with the rest of Mediawiki, but there don't seem to be any good parsing packages for PHP. (Please tell me if that's wrong.)
A good PHP parser library would be exceptionally useful for MediaWiki (and many extensions), at the moment we have loads of methods that do regex "parsing", so if you felt like writing one... :D.
I'd appreciate any advice or criticism. Since my only previous experience has been using Wikipedia and setting up a test Mediawiki instance for my ACM chapter, I'm only just now learning my way around the code base and it's not always evident why things were done as they are. Does this look like a reasonable and worthwhile project?
Step 5. has been a "we really should do this" for a while, the shipping of OCaml code which many users won't be able to use is very messy. I am less convinced of the utility of a Python port, OCaml is a great language for implementing this, and I fear a lot of your time would be wasted trying to make the Python similarly nice. As you note, MediaWiki is not written in Python, doing this in PHP would be a larger step in the right direction, though without such nice frameworks, maybe less nice to do.
Instead of rewriting the <math> parser, it might be more productive to create parsers for some of the other languages that extensions use, hopefully with a view to adding additional extensions to Wikipedia. The ones I can think of immediately are <chem> tags (bug 3252/5856), <gnuplot>, <lilypond>/<ABC> (bug 189!), <graphviz> (bug 2403).
Yours Conrad
(PS. I'm no-one official, so can be ignored safely)
2010/3/23 Conrad Irwin conrad.irwin@googlemail.com:
Instead of rewriting the <math> parser, it might be more productive to create parsers for some of the other languages that extensions use, hopefully with a view to adding additional extensions to Wikipedia. The ones I can think of immediately are <chem> tags (bug 3252/5856), <gnuplot>, <lilypond>/<ABC> (bug 189!), <graphviz> (bug 2403).
Note that there's already an ABC extension, as linked on bug 189, which AFAIK is pretty much ready for WMF deployment already. As mentioned on the same bug, shelling out to Lilypond has certain issues with unbounded time/CPU/memory usage. I'm not familiar with any of the other programs mentioined, so I can't comment on those.
Roan Kattouw (Catrope)
Hello Conrad,
- Implement an AMS-TeX validator
How different would this be from the current validator?
It should be exactly the same, except written in Python.
- Repackage the entire Math thing as an extension
I might do this if I have time left at the end. I'm sure the project will change over the summer.
This would be very amazing.
Maybe this should be my project, then.
Python doesn't have parsing just locked right down the way C does with flex/bison, but there are some good options, I have the most experience with it, and I think I'd be able to complete the port faster in Python than in either of the other languages. I was tempted at first to port to PHP, to conform with the rest of Mediawiki, but there don't seem to be any good parsing packages for PHP. (Please tell me if that's wrong.)
A good PHP parser library would be exceptionally useful for MediaWiki (and many extensions), at the moment we have loads of methods that do regex "parsing", so if you felt like writing one... :D.
Actually...
I've never used PHP for real programming, but how difficult would it be to write a really simple, stupid first pass at a DFA parser? I suspect I'd need much more than three months to make it useful, but would it be possible to implement some coherent subset of the features? E.g., building the LR0 automaton, at least?
I'd appreciate any advice or criticism. Since my only previous experience has been using Wikipedia and setting up a test Mediawiki instance for my ACM chapter, I'm only just now learning my way around the code base and it's not always evident why things were done as they are. Does this look like a reasonable and worthwhile project?
Step 5. has been a "we really should do this" for a while, the shipping of OCaml code which many users won't be able to use is very messy. I am less convinced of the utility of a Python port, OCaml is a great language for implementing this, and I fear a lot of your time would be wasted trying to make the Python similarly nice. As you note, MediaWiki is not written in Python, doing this in PHP would be a larger step in the right direction, though without such nice frameworks, maybe less nice to do.
I suggested a Python port because http://www.mediawiki.org/wiki/Summer_of_Code_2010#MediaWiki_core lists it as a potential project idea. I was under the impression that people around here did not want to leave texvc in OCaml. Is this wrong?
Yours, Damon Wang
On Tue, Mar 23, 2010 at 4:06 AM, Damon Wang damonwang@uchicago.edu wrote:
I'm interested in porting texvc to Python, and I was hoping this list here might help me hash out the plan. Please let me know if I should take my questions elsewhere.
Python is much better than OCaml, and I prefer Python to PHP, but a PHP implementation would be preferable for core IMO. Not all MediaWiki developers know Python, but all obviously know PHP. If you did a Python implementation, though, then at least someone could translate it to PHP pretty easily.
- Collect test cases and write a testing script
Thanks to avar from #wikimedia, I already have the <math>...</math> bits from enwiki and dewiki. I would also construct some simpler ones by hand to test each of the acceptable LaTeX commands.
Would there be any possibility of logging the input seen by texvc on a production instance of Mediawiki, so I could get some invalid input submitted by actual users?
This could also be useful to future maintainers for regression testing.
If you have a Unix box handy, it's pretty easy to install MediaWiki with math support so you can test yourself. sudo apt-get install mediawiki mediawiki-math should do it on anything Debian-based, for example.
- Implement an AMS-TeX validator
I'll probably use PLY because it's rumored to have helpful debugging features (designed for a first-year compilers class, apparently). ANTLR is another popular option, but this guy http://www.bearcave.com/software/antlr/antlr_expr.html thinks it's complicated and hard to debug. I've never used either, so if anyone on this list knows of a good Python parsing package I'd welcome suggestions.
If it's in PHP, you'd probably have to write a parser yourself, but LaTeX is pretty easy to parse, I'd think.
- Add HTML rendering to texvc and test script
I don't even understand how the existing texvc decides whether HTML is good enough. It looks like the original programmer just decreed that certain LaTeX commands could be rendered to HTML, and defaults to PNG if it sees anything not on that list. How important is this feature?
Fairly important, IMO, if the goal is to replace texvc, although not critical. <math>x</math> shouldn't render x as a PNG -- that's silly.
Python doesn't have parsing just locked right down the way C does with flex/bison, but there are some good options, I have the most experience with it, and I think I'd be able to complete the port faster in Python than in either of the other languages. I was tempted at first to port to PHP, to conform with the rest of Mediawiki, but there don't seem to be any good parsing packages for PHP. (Please tell me if that's wrong.)
Would it really be very hard to write a LaTeX parser in PHP? I'd think it could be done easily, if you permit only a carefully-selected subset. I don't think you'd need any parser theory, just use preg_split() and loop through all the tokens.
I'd appreciate any advice or criticism. Since my only previous experience has been using Wikipedia and setting up a test Mediawiki instance for my ACM chapter, I'm only just now learning my way around the code base and it's not always evident why things were done as they are. Does this look like a reasonable and worthwhile project?
Rewriting texvc in PHP would be a nice project to have, which is small enough in scope that I'm optimistic that it could be done in a summer. I'd say it's a good choice.
On Tue, Mar 23, 2010 at 6:23 AM, Conrad Irwin conrad.irwin@googlemail.com wrote:
I am not too fussed about the HTML output, though I can't speak for everyone, at the moment it seems that many more of the Unicode characters should be let through (at least at some level of HTML), though I don't know enough about worldwide unicode support.
I suspect we need to be about as conservative as we currently are for platforms like IE6 on XP. We should be able to expand the range of HTML characters in the future, though.
A good PHP parser library would be exceptionally useful for MediaWiki (and many extensions), at the moment we have loads of methods that do regex "parsing", so if you felt like writing one... :D.
Wouldn't a real generic parser implementation written in PHP be too slow to be useful? preg_replace() has the advantage of being implemented in C.
I am less convinced of the utility of a Python port, OCaml is a great language for implementing this, and I fear a lot of your time would be wasted trying to make the Python similarly nice. As you note, MediaWiki is not written in Python, doing this in PHP would be a larger step in the right direction, though without such nice frameworks, maybe less nice to do.
OCaml might be a great language for implementing this, but very few of us understand it. texvc has been totally unmaintained for years, other than new things being added to the whitelist sometimes by means of cargo-culting what previous commits do. Rewriting texvc in *anything* that more people understand would be a step forward.
On Tue, Mar 23, 2010 at 8:31 AM, Roan Kattouw roan.kattouw@gmail.com wrote:
As mentioned on the same bug, shelling out to Lilypond has certain issues with unbounded time/CPU/memory usage.
The same is true for LaTeX. Lilypond would just need a parser and filter to whitelist safe constructs, like LaTeX does.
On Tue, Mar 23, 2010 at 12:25 PM, Damon Wang damonwang@uchicago.edu wrote:
I've never used PHP for real programming, but how difficult would it be to write a really simple, stupid first pass at a DFA parser? I suspect I'd need much more than three months to make it useful, but would it be possible to implement some coherent subset of the features? E.g., building the LR0 automaton, at least?
I don't think you'd need a "real" parser here. Mostly we just use preg_split() for this sort of thing. I'm not familiar with formal grammars and such, so I can't say what the concrete disadvantages of that approach are.
I suggested a Python port because http://www.mediawiki.org/wiki/Summer_of_Code_2010#MediaWiki_core lists it as a potential project idea. I was under the impression that people around here did not want to leave texvc in OCaml. Is this wrong?
No, it's right. Conrad is crazy. :P
2010/3/23 Aryeh Gregor Simetrical+wikilist@gmail.com:
I've never used PHP for real programming, but how difficult would it be to write a really simple, stupid first pass at a DFA parser? I suspect I'd need much more than three months to make it useful, but would it be possible to implement some coherent subset of the features? E.g., building the LR0 automaton, at least?
I don't think you'd need a "real" parser here. Mostly we just use preg_split() for this sort of thing. I'm not familiar with formal grammars and such, so I can't say what the concrete disadvantages of that approach are.
DFAs parse regular languages, which means those languages can also be expressed as regexes. In fact, the regexes accepted by the preg_*() functions allow certain extensions to the language theory definition of regular expressions, allowing them to describe certain non-regular languages as well. In short: preg_split() can do everything a DFA can do, and more. The only reason to use a DFA parser would be performance, but since the preg_*() functions are so heavily optimized I don't think that'll be an issue.
I suggested a Python port because http://www.mediawiki.org/wiki/Summer_of_Code_2010#MediaWiki_core lists it as a potential project idea. I was under the impression that people around here did not want to leave texvc in OCaml. Is this wrong?
No, it's right. Conrad is crazy. :P
Having it in a language no one understands is a bad thing and leads to maintenance not happening, so yeah, we definitely want it rewritten in PHP. If the PHP implementation turns out to be too slow to run on WMF, for instance, we could do a C++ port à la wikidiff2 (a C++ port of our ludicrously slow PHP diff implementation).
Roan Kattouw (Catrope)
On 03/23/2010 05:00 PM, Roan Kattouw wrote:
I suggested a Python port because http://www.mediawiki.org/wiki/Summer_of_Code_2010#MediaWiki_core lists it as a potential project idea. I was under the impression that people around here did not want to leave texvc in OCaml. Is this wrong?
No, it's right. Conrad is crazy. :P
Having it in a language no one understands is a bad thing and leads to maintenance not happening, so yeah, we definitely want it rewritten in PHP. If the PHP implementation turns out to be too slow to run on WMF, for instance, we could do a C++ port à la wikidiff2 (a C++ port of our ludicrously slow PHP diff implementation).
And here was me thinking that maintenance didn't happen because making changes to security critical sections of the code is dangerous :). The current implementation is just over a thousand lines of exceedingly concise code, while I agree that a re-implementation in PHP is probably sensible, I'll stubbornly maintain that the existing OCaml is more suited to the task. (Oh, and it seems I misread that proposal; I could not imagine a language other than LaTeX being useful for doing maths :p).
While re-implementing the syntax whitelister would not be too hard, LaTeX, with it's wonderfully re-definable syntax is incredibly dangerous. Have fun, and be careful!
Conrad
http://tug.ctan.org/cgi-bin/ctanPackageInformation.py?id=xii
On Tue, Mar 23, 2010 at 1:00 PM, Roan Kattouw roan.kattouw@gmail.com wrote:
DFAs parse regular languages, which means those languages can also be expressed as regexes. In fact, the regexes accepted by the preg_*() functions allow certain extensions to the language theory definition of regular expressions, allowing them to describe certain non-regular languages as well. In short: preg_split() can do everything a DFA can do, and more. The only reason to use a DFA parser would be performance, but since the preg_*() functions are so heavily optimized I don't think that'll be an issue.
This much I know, but is LaTeX actually a regular language?
On Tue, Mar 23, 2010 at 1:13 PM, Conrad Irwin conrad.irwin@googlemail.com wrote:
And here was me thinking that maintenance didn't happen because making changes to security critical sections of the code is dangerous :).
It's not security-critical. The worst you could possibly do is DoS, and any DoS could be instantly shut off by just turning off math briefly. Furthermore, the part that makes DoS impossible is a quite small portion of the code that would need to change effectively never. No, the problem is that most PHP programmers have never even heard of OCaml, let alone used it.
2010/3/23 Aryeh Gregor Simetrical+wikilist@gmail.com:
This much I know, but is LaTeX actually a regular language?
I don't know; I was just making the point that writing a DFA parser in PHP is probably not very useful.
Roan Kattouw (Catrope)
2010/3/23 Roan Kattouw roan.kattouw@gmail.com:
2010/3/23 Aryeh Gregor Simetrical+wikilist@gmail.com:
This much I know, but is LaTeX actually a regular language?
I don't know; I was just making the point that writing a DFA parser in PHP is probably not very useful.
Sorry, I got confused and wrote DFA when I should have written LALR. DFAs cannot parse even the allowed subset of AMS-LaTeX, because there are some permitted environments.
Without claiming to know much formal language theory, a rule of thumb is that languages with matched delimiters were never regular, because of the pumping lemma: http://en.wikipedia.org/wiki/Pumping_lemma_for_regular_languages
So, for example, it's theoretically impossible to check that parentheses nested correctly using regular expressions, and similarly it'd be impossible to check that the \begin and \end commands matched up.
In practice there might be ways to hack around that by using multiple regular expressions and manually tracking how they nest, but at that point we're basically writing half of a bad LALR parser.
Fortunately, though, Python has parser generators! And if we're really concerned about speed, there's PyBison, which does the parsing in C and apparently produces (at least) five-fold improvements over Python-native alternatives.
Yours, Damon Wang
On 03/23/2010 05:23 PM, Aryeh Gregor wrote:
On Tue, Mar 23, 2010 at 1:00 PM, Roan Kattouw roan.kattouw@gmail.com wrote:
DFAs parse regular languages, which means those languages can also be expressed as regexes. In fact, the regexes accepted by the preg_*() functions allow certain extensions to the language theory definition of regular expressions, allowing them to describe certain non-regular languages as well. In short: preg_split() can do everything a DFA can do, and more. The only reason to use a DFA parser would be performance, but since the preg_*() functions are so heavily optimized I don't think that'll be an issue.
This much I know, but is LaTeX actually a regular language?
It's not even context free, luckily the subset we are interested in is (as clearly shown by the texvc parser :p).
On Tue, Mar 23, 2010 at 1:13 PM, Conrad Irwin conrad.irwin@googlemail.com wrote:
And here was me thinking that maintenance didn't happen because making changes to security critical sections of the code is dangerous :).
It's not security-critical. The worst you could possibly do is DoS, and any DoS could be instantly shut off by just turning off math briefly. Furthermore, the part that makes DoS impossible is a quite small portion of the code that would need to change effectively never. No, the problem is that most PHP programmers have never even heard of OCaml, let alone used it.
Many LaTeX installations can be made read/write/execute anything by default. LaTeX also allows you to redefine the meaning of characters in the input, if you accidentally let a single command through, then all the whitelisting becomes pointless. It certainly is a security issue.
Conrad
Conrad Irwin wrote:
On 03/23/2010 05:23 PM, Aryeh Gregor wrote:
On Tue, Mar 23, 2010 at 1:00 PM, Roan Kattouw roan.kattouw@gmail.com wrote:
DFAs parse regular languages, which means those languages can also be expressed as regexes. In fact, the regexes accepted by the preg_*() functions allow certain extensions to the language theory definition of regular expressions, allowing them to describe certain non-regular languages as well. In short: preg_split() can do everything a DFA can do, and more. The only reason to use a DFA parser would be performance, but since the preg_*() functions are so heavily optimized I don't think that'll be an issue.
This much I know, but is LaTeX actually a regular language?
It's not even context free, luckily the subset we are interested in is (as clearly shown by the texvc parser :p).
Just because a language is context-sensitive doesn't mean it will be hard to write a parser for it. That's just a myth propagated by computer scientists who, strangely enough given their profession, have a disdain for the algorithm as a descriptive framework.
In the last few decades, pure mathematicians have been exploring the power of algorithms as a general description of an axiomatic system. And simultaneously, computer scientists have embraced the idea that the best way to process text is by trying to shoehorn all computer languages into some Chomsky-inspired representation, regardless of how awkward that representation is, or how inefficient the resulting algorithm becomes, when compared to an algorithm constructed a priori.
-- Tim Starling
I think we should really consider LOLCODE for this sort of thing.
http://en.wikipedia.org/wiki/Lolcode
It's just more fun!
- Trevor
On 3/23/10 3:44 PM, Tim Starling wrote:
Conrad Irwin wrote:
On 03/23/2010 05:23 PM, Aryeh Gregor wrote:
On Tue, Mar 23, 2010 at 1:00 PM, Roan Kattouwroan.kattouw@gmail.com wrote:
DFAs parse regular languages, which means those languages can also be expressed as regexes. In fact, the regexes accepted by the preg_*() functions allow certain extensions to the language theory definition of regular expressions, allowing them to describe certain non-regular languages as well. In short: preg_split() can do everything a DFA can do, and more. The only reason to use a DFA parser would be performance, but since the preg_*() functions are so heavily optimized I don't think that'll be an issue.
This much I know, but is LaTeX actually a regular language?
It's not even context free, luckily the subset we are interested in is (as clearly shown by the texvc parser :p).
Just because a language is context-sensitive doesn't mean it will be hard to write a parser for it. That's just a myth propagated by computer scientists who, strangely enough given their profession, have a disdain for the algorithm as a descriptive framework.
In the last few decades, pure mathematicians have been exploring the power of algorithms as a general description of an axiomatic system. And simultaneously, computer scientists have embraced the idea that the best way to process text is by trying to shoehorn all computer languages into some Chomsky-inspired representation, regardless of how awkward that representation is, or how inefficient the resulting algorithm becomes, when compared to an algorithm constructed a priori.
-- Tim Starling
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Wed, Mar 24, 2010 at 9:16 AM, Trevor Parscal tparscal@wikimedia.org wrote:
I think we should really consider LOLCODE for this sort of thing.
http://en.wikipedia.org/wiki/Lolcode
It's just more fun!
- Trevor
Also rewrite parser functions to use it? that would be interesting on en.wiki since they are always complaining about the syntax.
jks
-Peacvhey
On 23 March 2010 23:19, K. Peachey p858snake@yahoo.com.au wrote:
On Wed, Mar 24, 2010 at 9:16 AM, Trevor Parscal tparscal@wikimedia.org wrote:
I think we should really consider LOLCODE for this sort of thing. http://en.wikipedia.org/wiki/Lolcode It's just more fun!
Also rewrite parser functions to use it? that would be interesting on en.wiki since they are always complaining about the syntax.
I can think of little more appropriate.
- d.
On 03/23/2010 10:44 PM, Tim Starling wrote:
Just because a language is context-sensitive doesn't mean it will be hard to write a parser for it. That's just a myth propagated by computer scientists who, strangely enough given their profession, have a disdain for the algorithm as a descriptive framework.
Context free grammars have a strong advantage because they come with documentation built in. Given a BNF-esque description of a language, it is possible to understand, at a high level, how the language works. This means it's easy to write a parser (though much easier to get the parser written automatically), it's also more pleasant to verify properties of the language (no state to keep track of), and to reason about the consequences of modifications. Of course it is possible to give decent documentation for a context-sensitive language; from what I've seen, this just doesn't happen.
Take for example Python and perl, or Markdown and MediaWiki. In both cases the former has a syntax that can be mainly modeled by a context free grammar and there are many implementations that all work. The latter of the pair has a context sensitive grammar defined only by the reference implementation, there are no other parsers with feature completeness. This is certainly not a technical limitation, rather a reflection on the ability of your average human programmer.
Conrad
Is my impression that this is a problem where a PHP implementation could be better. Who cares if is slow? the result can be cache forever?, is something you will run only once, and the heavyweight work (draw) will be made by C compiled code like the GD library?.
you need speed in stuff that run inside loops (runs N times), or on stuff that delay other stuff (can't be made async), or on stuff that is CPU intensive and runs every time (the computer get angry), or stuff that is very IO intensive (mechanical stuff is slow), or stuff that nees gigawats of memory (the memory gets angry if you touch lots of pages).
stuff that is paralelizable, async, memory light, and not IO intensive don't need any optimization at all. write the better code and have a big smiley :-)
On Tue, Mar 23, 2010 at 6:28 PM, Conrad Irwin conrad.irwin@googlemail.com wrote:
Many LaTeX installations can be made read/write/execute anything by default.
What does that mean? LaTeX can invoke external programs? Using what commands? Is this functionality actually enabled in practice in stock LaTeX installs?
On 03/24/2010 02:00 PM, Aryeh Gregor wrote:
On Tue, Mar 23, 2010 at 6:28 PM, Conrad Irwin conrad.irwin@googlemail.com wrote:
Many LaTeX installations can be made read/write/execute anything by default.
What does that mean? LaTeX can invoke external programs? Using what commands? Is this functionality actually enabled in practice in stock LaTeX installs?
Yes, \openout, \write, \closeout, \openin, \read, \closein. The infamous one is \write18, 18 is a special file descriptor that just executes shell commands, you can also use \openin={|<shell command>}.
People have noticed this problem, so some distributions disable \write18 (and opening with |), and also configure it such that files can only be read and written within the current directory or subdirectories. This is, to my knowledge, not by-passable.
There's also a (more recent) mode called "restricted write18", used by more user-friendly distributions, this allows some commands (such as bibtex or tex) to be used with write18 and |. Sadly, as arguments passed to the commands are not validated (though they are shell escaped), it is possible to break out of the sandbox.
Even if it were not possible to break out of restricted write18, there will exist installations with write18 enabled, and I can't imagine people remembering to check. Depending on the flavour of LaTeX in use it is often possible to pass -no-shell-escape or --disable-write18 on the command line, we should probably do that, unless unknown flags cause errors, I'm not sure.
Conrad
Even if it were not possible to break out of restricted write18, there will exist installations with write18 enabled, and I can't imagine people remembering to check. Depending on the flavour of LaTeX in use it
If people won't remember, surely either the MediaWiki installer or the extension itself can be made to check this and simply refuse to enable <math> tags.
On Wed, Mar 24, 2010 at 10:43 AM, Conrad Irwin conrad.irwin@googlemail.com wrote:
Yes, \openout, \write, \closeout, \openin, \read, \closein. The infamous one is \write18, 18 is a special file descriptor that just executes shell commands, you can also use \openin={|<shell command>}.
People have noticed this problem, so some distributions disable \write18 (and opening with |), and also configure it such that files can only be read and written within the current directory or subdirectories. This is, to my knowledge, not by-passable.
As long as the worst that could happen on a large majority of installations is DoS, I don't think we should be afraid to rewrite the code just because *maybe* it would be less secure. We should obviously check over the new code carefully, but I wouldn't say it's any more security-critical than random pieces of MediaWiki -- which are typically vulnerable to XSS if someone forgets to escape something.
Aryeh Gregor wrote:
As long as the worst that could happen on a large majority of installations is DoS, I don't think we should be afraid to rewrite the code just because *maybe* it would be less secure. We should obviously check over the new code carefully, but I wouldn't say it's any more security-critical than random pieces of MediaWiki -- which are typically vulnerable to XSS if someone forgets to escape something.
Getting shell access is not a DoS or XSS. Specially for a large majority of installs where it compromises their only account. Does this mean that we shouldn't rewrite it? No. We should rewrite it, and make it more secure. We start it by having enough eyes on the code. I wouldn't be surprised if we found a vulnerability on texvc during the rewrite.
Running the LaTeX interpreter under ulimit -u 1 should be provide a quite safe default against external launches. But take into account that file writes are also a dangerous vector.
Hi Damon,
Thank you so much for floating your GSoC ideas early here on the mailing list! Putting out concrete examples we can weigh in on is really helpful, and engaging in this way is a fantastic way of demonstrating how you'll be able to engage with us if we select your project.
On Tue, Mar 23, 2010 at 1:06 AM, Damon Wang damonwang@uchicago.edu wrote:
I'm interested in porting texvc to Python, and I was hoping this list here might help me hash out the plan.
As I'm sure you've already gathered from the other responses, this is exactly the right place. I'm a little skeptical myself that porting that particular piece of code from OCaml to Python is going to be a really big win for us (because it's still a "foreign" language as far as PHP-based MediaWiki is concerned, so integration is still a little clunky and performance may take a hit due to yet another interpreter needing to load), but I'll let others weigh in on whether I'm making too big a deal about that.
Stepping back from the specifics of your proposal (which I think the others on this list have responded to pretty well), I'd like to find out more about what general sorts of projects interest you the most, which may help us figure out if we should keep going in this direction. Some questions: 1. Are you most interested in having a Python-based project, or would you be *equally* happy and productive programming something in PHP? 2. Are you zeroing in on <math> parsing and parsing in general because that's an area that you're already developing expertise in and/or are deeply interested in getting into, or is that just something that looked kinda interesting to learn about relative to other opportunities you considered? 3. Are you coming at this as someone who is already deep into Wikipedia/MediaWiki usage who is looking to resolve particular things (like <math> parsing) that are painful as an end user, or are you more casually involved and more interested in applying in this project because it looks like we've got a lot of interesting programming problems to solve?
Just to be really clear, I'm not looking for a "right" answer on any of those questions. It's not necessary for you to be even interested in getting deeply involved in the Wikipedia user community to have a really successful project. The purpose of this line of questions is to figure out if we should continue helping you refine your current idea, or suggest some other direction that's a bigger payoff and/or easier sell.
Rob
Hello Rob,
Just to be really clear, I'm not looking for a "right" answer on any of those questions. It's not necessary for you to be even interested in getting deeply involved in the Wikipedia user community to have a really successful project. The purpose of this line of questions is to figure out if we should continue helping you refine your current idea, or suggest some other direction that's a bigger payoff and/or easier sell.
I understand, and that'd be very helpful. To be honest, I'm not passionately committed to any project at all. I've been writing projects for university and for a computer lab I work at, but it's mostly small, one-off sysadmin things and usually the emphasis is more on "xyz server has to be back up before we open tomorrow" than writing good, clean code. So, yes, I'd welcome other suggestions.
As I'm sure you've already gathered from the other responses, this is exactly the right place. I'm a little skeptical myself that porting that particular piece of code from OCaml to Python is going to be a really big win for us (because it's still a "foreign" language as far as PHP-based MediaWiki is concerned, so integration is still a little clunky and performance may take a hit due to yet another interpreter needing to load), but I'll let others weigh in on whether I'm making too big a deal about that.
There are ways to make this run faster if performance is a concern. For example, mod_python or mod_wscgi, or explicitly pulling the Python out into a standalone daemon that listens for requests from the webserver.
Another possibility be writing it in C to avoid all interpreter overhead, and using a foreign function interface. Unfortunately, I'm not familiar with PHP's FFI. Google takes me to http://wiki.php.net/rfc/php_native_interface which seems to think that as of a year ago there weren't any good ones, but this doesn't look too painful: http://theserverpages.com/php/manual/en/zend.creating.php
Stepping back from the specifics of your proposal (which I think the others on this list have responded to pretty well), I'd like to find out more about what general sorts of projects interest you the most, which may help us figure out if we should keep going in this direction. Some questions:
- Are you most interested in having a Python-based project, or would you
be *equally* happy and productive programming something in PHP?
I'm most familiar with Python and C, for whatever that's worth coming from an undergrad who didn't know Python existed five years ago. I learned PHP to maintain the web interfaces of an in-house print system at work, but I haven't used it for anything as involved as what we're discussing here. So, in terms of productivity, yes, if I have to work in PHP my mentor will probably get asked a few more newbie questions.
In terms of happiness, though, it'd be a great opportunity to dig into PHP and finally learn to use it as more than really smart CSS with a database connection. Although I prefer Python or even C because I think I'd be more useful, I wouldn't be very upset at all if it turned out you guys were willing to let me learn PHP on your time.
- Are you zeroing in on <math> parsing and parsing in general because
that's an area that you're already developing expertise in and/or are deeply interested in getting into, or is that just something that looked kinda interesting to learn about relative to other opportunities you considered?
I like the <math> parsing project because it seems well-suited for a third-year undergrad who knows LaTeX and reads a few other functional languages and has studied lex/yacc before in his coursework. The goals are clear, and I know how to break them down into smaller problems and how to tackle each one. It's a little isolated from the rest of Mediawiki, so I don't need to grok the entire code base.
Basically, this looks like a way to make a concrete contribution despite being a newcomer to the project. That doesn't mean I'm not happy to entertain alternatives, just that they have a pretty high bar to clear.
- Are you coming at this as someone who is already deep into
Wikipedia/MediaWiki usage who is looking to resolve particular things (like <math> parsing) that are painful as an end user, or are you more casually involved and more interested in applying in this project because it looks like we've got a lot of interesting programming problems to solve?
The second. I just want to tackle a problem that's near but not quite beyond my limits, and if I can help out a site I use daily, so much the better.
Yours, Damon Wang
On Tue, Mar 23, 2010 at 2:00 PM, Damon Wang damonwang@uchicago.edu wrote:
I've been writing projects for university and for a computer lab I work at, but it's mostly small, one-off sysadmin things and usually the emphasis is more on "xyz server has to be back up before we open tomorrow" than writing good, clean code. So, yes, I'd welcome other suggestions.
Cool! So, I'm assuming you're looking forward to an opportunity to write good, clean code as a summer project. :)
There are ways to make [Python-based extensions] run faster if performance
is a concern. For example, mod_python or mod_wscgi, or explicitly pulling the Python out into a standalone daemon that listens for requests from the webserver.
Personally, I'd avoid trying to make that pitch for a GSoC project. While you're right that Python is a pretty defensible choice when embarking on a large project, trading one dependency for another for this size/scale of project won't be as compelling as eliminating a dependency altogether.
Of course, as I say that, I see Platonides disagrees with me here. Choosing Python is not a huge disadvantage in this context, but it's not going to have the same unanimous(-ish) approval of using PHP.
Another possibility be writing it in C to avoid all interpreter overhead, and using a foreign function interface. Unfortunately, I'm not familiar with PHP's FFI. Google takes me to http://wiki.php.net/rfc/php_native_interface which seems to think that as of a year ago there weren't any good ones, but this doesn't look too painful: http://theserverpages.com/php/manual/en/zend.creating.php
I think straight PHP would be fine for this particular project. The downside of a C implementation is that, while its almost certainly going to have the best performance characteristics, it also makes it more likely to fall into disrepair and be a possible source of buffer overruns and other security issues.
The nice thing about a PHP port (if done correctly) is that it would be a trivial install for small wikis and Wikipedia alike. That translates into more usage, which in turn translates into higher likelihood that it stays maintained.
That said, there have got to be a ton of projects that could benefit from PHP->native C bindings. I'm going to leave it to some other folks to suggest projects in this area.
I'm most familiar with Python and C, for whatever that's worth coming from an undergrad who didn't know Python existed five years ago. I learned PHP to maintain the web interfaces of an in-house print system at work, but I haven't used it for anything as involved as what we're discussing here. So, in terms of productivity, yes, if I have to work in PHP my mentor will probably get asked a few more newbie questions.
In terms of happiness, though, it'd be a great opportunity to dig into PHP and finally learn to use it as more than really smart CSS with a database connection. Although I prefer Python or even C because I think I'd be more useful, I wouldn't be very upset at all if it turned out you guys were willing to let me learn PHP on your time.
There's a few Python-based things that might be interesting, but I think you'll get a lot more love for doing something in PHP or C. Since this is a student internship, you shouldn't be bashful about using this as a learning opportunity.
I'd only caution against convincing yourself (and us) that you'll be more interested in learning something like PHP than you truly are. It might help you land a spot, but it will work against you in having a successful project, and this has such high visibility that you'll really want to be successful. So, if you find yourself thinking about doing this in PHP and having your inner voice say "meh", then I'd recommend sticking to your guns and propose doing this or something else in Python and/or C.
- Are you zeroing in on <math> parsing and parsing in general because
that's an area that you're already developing expertise in and/or are
deeply
interested in getting into, or is that just something that looked kinda interesting to learn about relative to other opportunities you
considered?
I like the <math> parsing project because it seems well-suited for a third-year undergrad who knows LaTeX and reads a few other functional languages and has studied lex/yacc before in his coursework. The goals are clear, and I know how to break them down into smaller problems and how to tackle each one. It's a little isolated from the rest of Mediawiki, so I don't need to grok the entire code base.
Basically, this looks like a way to make a concrete contribution despite being a newcomer to the project. That doesn't mean I'm not happy to entertain alternatives, just that they have a pretty high bar to clear.
This is a really smart way of thinking about this, so that's great that you're thinking the right way about the project scope. I agree with you that finding something reasonably well-contained is going to be the best strategy for success.
- Are you coming at this as someone who is already deep into
Wikipedia/MediaWiki usage who is looking to resolve particular things
(like
<math> parsing) that are painful as an end user, or are you more casually involved and more interested in applying in this project because it looks like we've got a lot of interesting programming problems to solve?
The second. I just want to tackle a problem that's near but not quite beyond my limits, and if I can help out a site I use daily, so much the better.
Wonderful! Great reason to get involved!
Rob
There's a few Python-based things that might be interesting, but I think you'll get a lot more love for doing something in PHP or C. Since this is a student internship, you shouldn't be bashful about using this as a learning opportunity.
I'd only caution against convincing yourself (and us) that you'll be more interested in learning something like PHP than you truly are. It might help you land a spot, but it will work against you in having a successful project, and
this has such high visibility that you'll really want to be successful.
What visibility does this have? I thought it was some abandoned corner of the wiki that nobody has touched in the seven years since it was first written. What happens if I make a hash of this?
So, if you find yourself thinking about doing this in PHP and having your inner voice say "meh", then I'd recommend sticking to your guns and propose doing this or something else in Python and/or C.
Well, now my inner voice says, "I really don't want to make a hash of this texvc port!", so let me explain why I want to do it in Python rather than PHP. I agree that the performance will probably be just fine, and that it would be a great coup for maintainability and installation and usage. The problem is, I don't think PHP has a parser-generator package.
So let me make sure I understand the problem here. You already have a texvc implementation that has worked just fine for the last seven years. TeX is pretty stable at this point, so chances are good you'd make it another seven years without problems. But you're still dissatisfied because OCaml is a hard language to find programmers for, and the existing implementation isn't really maintained. You want it ported to a different language that has more programmers available.
(You also as a Mediawiki extension rather than a core feature; I'm going to do that, but I won't say anything more because it seems fairly uncontroversial.)
Since the subset of TeX you need parsed has a context-free grammar, it needs an LALR parser, not just a bunch of regexes. I know three ways to get an LALR parser:
(1) write a pushdown automaton manually (i.e., be yacc) (2) write input for a parser-generator (3) write a parser-generator, and give it input
Option (2) is the most maintainable and feasible option, and it's precisely the one that cannot be done in PHP. As far as I know, PHP has no parser-generator package. (Please, please let me know if that's incorrect so I can stop embarrassing myself and get on with writing a GSoC proposal.)
I could probably do (1), or some hackish kludge at half of it, by throwing custom control structures into a bucketload of regexes, but I don't think that's in the project's best interests. As has been pointed out, the OCaml implementation is really concise and elegant. A large fraction of that concision and elegance comes from not actually being a parser but rather only a context-free grammar written in a BNF-like syntax common to most parser-generators.
I think it'd be easier to find a programmer who has worked with a parser-generator and can learn a little bit of OCaml, than it would be to find a PHP programmer who has to read himself into a manually implemented parser. After all, how many PHP programmers do you know who have experience mucking around inside an LALR parser?
So that's why, while I'm happy to take it on in PHP as a learning experience for myself, I think it'd be better for Mediawiki to port texvc to Python. That gets us the larger pool of potential maintainers that comes with using a commonly known language, without sacrificing the amazing advantage of only needing to maintain a grammar rather than the parser itself.
And as far as dependencies are concerned, Python is still a much easier dependency to satisfy, both for programmers working with the code and for sysadmins installing it.
What do you guys think?
Also, would anyone be interested in mentoring this project?
Yours, Damon Wang
On Fri, Mar 26, 2010 at 7:48 PM, Damon Wang damonwang@uchicago.edu wrote:
There's a few Python-based things that might be interesting, but I think you'll get a lot more love for doing something in PHP or C. Since this is a student internship, you shouldn't be bashful about using this as a learning opportunity.
I'd only caution against convincing yourself (and us) that you'll be more interested in learning something like PHP than you truly are. It might help you land a spot, but it will work against you in having a successful project, and
this has such high visibility that you'll really want to be successful.
What visibility does this have? I thought it was some abandoned corner of the wiki that nobody has touched in the seven years since it was first written. What happens if I make a hash of this?
Hi Damon,
Oops....that was a little ambiguous and probably applies a little more pressure than intended. What I meant to say is that Google Summer of Code generally is pretty high visibility, not this project in particular. Projects often go back and review results from previous years (just like we did: http://www.mediawiki.org/wiki/Summer_of_Code_Past_Projects ). There's plenty of ways to have a noble failure that won't reflect poorly on you, but that's probably not what you should aim for. There's nothing particularly high profile about this particular project relative to other GSoC stuff.
Anyway, in response to the specifics about Python/texvc. I was looking around for some ideas about how to approach replacing texvc with a Python implementation, and stumbled into this: http://www.mediawiki.org/wiki/Texvc_PHP_Alternative
That implementation seems to punt on the whole parsing thing, and as near as I can tell from a cursory reading, just passes it all through to latex, so that probably won't do. However, there may be something I'm missing.
Interestingly enough, though, looking at the Talk page for that leads you here: http://sourceforge.net/projects/latex2mathml/
http://sourceforge.net/projects/latex2mathml/This *does* have a parser. As you might expect, the code looks pretty involved, and seems to be handling parsing 101 without the benefit of anything other than the trusty substr and strpos functions. There's enough code there doing enough character-by-character manipulation that it makes me fear for the performance. Still, it looks like there's some serious work that's actually done, so it bears some level of investigation.
Anyway, I hear what you're saying about Python's much better parsing support (it wasn't too long ago I was gushing about the simpleparse module on my blog[1]). Given the number of other external dependencies that would probably still remain even with a PHP implementation, it's probably not worth sweating the additional Python dependency in the grand scheme of things. Python seems like a much less daunting dependency than OCaml, but I know far too little about OCaml to actually assert that with any confidence.
Regardless of which path you choose, I'd be happy to be your mentor assuming we have enough slots for this project.
On Fri, Mar 26, 2010 at 10:48 PM, Damon Wang damonwang@uchicago.edu wrote:
(You also as a Mediawiki extension rather than a core feature; I'm going to do that, but I won't say anything more because it seems fairly uncontroversial.)
I actually disagree with this pretty strongly. It would be a regression in functionality for existing users -- if they upgrade, their wiki breaks unless they install a new extension. There's no reason to remove it from core that I see that outweighs this disadvantage.
Since the subset of TeX you need parsed has a context-free grammar, it needs an LALR parser, not just a bunch of regexes. I know three ways to get an LALR parser:
(1) write a pushdown automaton manually (i.e., be yacc) (2) write input for a parser-generator (3) write a parser-generator, and give it input
Option (2) is the most maintainable and feasible option, and it's precisely the one that cannot be done in PHP. As far as I know, PHP has no parser-generator package. (Please, please let me know if that's incorrect so I can stop embarrassing myself and get on with writing a GSoC proposal.)
I could probably do (1), or some hackish kludge at half of it, by throwing custom control structures into a bucketload of regexes, but I don't think that's in the project's best interests. As has been pointed out, the OCaml implementation is really concise and elegant. A large fraction of that concision and elegance comes from not actually being a parser but rather only a context-free grammar written in a BNF-like syntax common to most parser-generators.
Okay, well, maybe you're right. I'd be interested to hear Tim Starling's opinion on this (using parser generators vs. writing by hand). Writing it in Python would certainly be a big step forward from OCaml -- any site with LaTeX accessible to MediaWiki will almost certainly have Python available, so Python vs. PHP should make no difference to end-users. And Python is probably the second-best-known language among MediaWiki hackers.
I think it'd be easier to find a programmer who has worked with a parser-generator and can learn a little bit of OCaml, than it would be to find a PHP programmer who has to read himself into a manually implemented parser. After all, how many PHP programmers do you know who have experience mucking around inside an LALR parser?
The parsing part is unlikely to need much maintenance. There are other things currently in OCaml that make more sense to modify from time to time -- like the whitelist of commands, and (some of?) the code for non-image output formats. So for instance, MathML output is theoretically supported, but I don't know how good the support is. That might become more important in the future, since Firefox is likely to support inline MathML in text/html not too long from now. This sort of thing would be harder if it were Python rather than PHP.
I don't think it would be a big deal if it were rewritten entirely in Python, though. It would be a big step forward in any case, and if it's easier for you, great. So personally I'd be okay with it, although it's perhaps not ideal.
Also, would anyone be interested in mentoring this project?
I probably wouldn't be of any help for this particular project, since I don't know anything about parsers, and my Python and TeX are passable but not great. We could probably come up with a mentor, though.
On 28/03/10 18:59, Aryeh Gregor wrote:
On Fri, Mar 26, 2010 at 10:48 PM, Damon Wangdamonwang@uchicago.edu wrote:
(You also as a Mediawiki extension rather than a core feature; I'm going to do that, but I won't say anything more because it seems fairly uncontroversial.)
I actually disagree with this pretty strongly. It would be a regression in functionality for existing users -- if they upgrade, their wiki breaks unless they install a new extension. There's no reason to remove it from core that I see that outweighs this disadvantage.
Since the subset of TeX you need parsed has a context-free grammar, it needs an LALR parser, not just a bunch of regexes. I know three ways to get an LALR parser:
(1) write a pushdown automaton manually (i.e., be yacc) (2) write input for a parser-generator (3) write a parser-generator, and give it input
Option (2) is the most maintainable and feasible option, and it's precisely the one that cannot be done in PHP. As far as I know, PHP has no parser-generator package. (Please, please let me know if that's incorrect so I can stop embarrassing myself and get on with writing a GSoC proposal.)
I could probably do (1), or some hackish kludge at half of it, by throwing custom control structures into a bucketload of regexes, but I don't think that's in the project's best interests. As has been pointed out, the OCaml implementation is really concise and elegant. A large fraction of that concision and elegance comes from not actually being a parser but rather only a context-free grammar written in a BNF-like syntax common to most parser-generators.
Okay, well, maybe you're right. I'd be interested to hear Tim Starling's opinion on this (using parser generators vs. writing by hand). Writing it in Python would certainly be a big step forward from OCaml -- any site with LaTeX accessible to MediaWiki will almost certainly have Python available, so Python vs. PHP should make no difference to end-users. And Python is probably the second-best-known language among MediaWiki hackers.
Have you had a look at pyparsing, which is a ready-made all-singing-all-dancing Python parser package with a large amount of syntactic sugar built in to allow the more-or-less direct input of grammar notations?
Given that the texvc source already has a grammar encoded into it in machine-executable form, it might be an idea to consider mechanically extract that grammar from the texvc OCaml source, and then reformatting it into a grammar in pyparsing's natural format.
-- Neil
"Aryeh Gregor" Simetrical+wikilist@gmail.com wrote in message news:7c2a12e21003281059i551c4650p8a8e51e100b62479@mail.gmail.com...
On Fri, Mar 26, 2010 at 10:48 PM, Damon Wang damonwang@uchicago.edu wrote:
(You also as a Mediawiki extension rather than a core feature; I'm going to do that, but I won't say anything more because it seems fairly uncontroversial.)
I actually disagree with this pretty strongly. It would be a regression in functionality for existing users -- if they upgrade, their wiki breaks unless they install a new extension. There's no reason to remove it from core that I see that outweighs this disadvantage.
As opposed to their wiki breaking when they upgrade for all the other reasons that we document in the release notes? I have never built a wiki where texvc has been needed, wanted, or even thought harmless. Currently MW users have to compile and configure a binary from a language 99.99% of them cannot understand, and enable the functionality using config variables. Asking them instead to download and install an extension like every other non-ubiquitous feature in MediaWiki is far from being a regression.
--HM
On Sun, Mar 28, 2010 at 7:10 PM, Happy-melon happy-melon@live.com wrote:
As opposed to their wiki breaking when they upgrade for all the other reasons that we document in the release notes?
If you make sure to run update.php, it's very rare for your wiki to break, unless you've hacked things or not updated your extensions or such. We're usually pretty careful to avoid significant regressions when upgrading wikis that are using supported/sane configurations.
I have never built a wiki where texvc has been needed, wanted, or even thought harmless.
Granted, this is not as widely used as some other optional features. There are certainly many wikis that do use it, though -- it's not like no one will be affected.
Currently MW users have to compile and configure a binary from a language 99.99% of them cannot understand, and enable the functionality using config variables. Asking them instead to download and install an extension like every other non-ubiquitous feature in MediaWiki is far from being a regression.
It's a regression for people who already have math working. What's the advantage? We have an awful lot of marginal features in core. When have we ever split a feature that we'd released in core into an extension, when a significant number of people were using it? I don't see the point. It's not like we're going to significantly reduce the size of the tarball or anything.
If you make sure to run update.php, it's very rare for your wiki to break, unless you've hacked things or not updated your extensions or such. We're usually pretty careful to avoid significant regressions when upgrading wikis that are using supported/sane configurations.
Can we make update.php ask the user if he wants to install the new extension?
It's a regression for people who already have math working. What's the advantage? We have an awful lot of marginal features in core. When have we ever split a feature that we'd released in core into an extension, when a significant number of people were using it? I don't see the point. It's not like we're going to significantly reduce the size of the tarball or anything.
Is there any place we could get usage statistics for the math feature? I think the advantages for new installations justify inconveniencing some existing users, especially if we can automate installation of the new extension, but this discussion would be better with some data.
Yours, Damon Wang
On Sun, Mar 28, 2010 at 11:45 PM, Damon Wang damonwang@uchicago.edu wrote:
Can we make update.php ask the user if he wants to install the new extension?
That would be hacky and unreliable. We'd have to make sure the versions match, automatically alter LocalSettings.php (!), hope that the wiki files are writable to the web server (they probably aren't), hope that it's not on a firewalled intranet, . . . also, update.php doesn't require user interaction, and changing that would break everything.
Is there any place we could get usage statistics for the math feature?
No, we don't have this kind of tracking in place. People would probably object if we did.
I think the advantages for new installations justify inconveniencing some existing users, especially if we can automate installation of the new extension, but this discussion would be better with some data.
I'm not so much worried about math specifically as about what would happen if we started systematically moving relatively-unused things from core to extensions. Few people use math, but a whole lot of people probably use at least one little-used feature that could be moved to an extension. We generally haven't moved things from core to extensions AFAIK -- if we started doing it, it could cumulatively have repercussions on ease of upgrade, for no real benefit that I see.
On Mon, Mar 29, 2010 at 12:12 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
On Sun, Mar 28, 2010 at 11:45 PM, Damon Wang damonwang@uchicago.edu wrote:
Can we make update.php ask the user if he wants to install the new extension?
That would be hacky and unreliable. We'd have to make sure the versions match, automatically alter LocalSettings.php (!), hope that the wiki files are writable to the web server (they probably aren't), hope that it's not on a firewalled intranet, . . . also, update.php doesn't require user interaction, and changing that would break everything.
Is there any place we could get usage statistics for the math feature?
No, we don't have this kind of tracking in place. People would probably object if we did.
I think the advantages for new installations justify inconveniencing some existing users, especially if we can automate installation of the new extension, but this discussion would be better with some data.
I'm not so much worried about math specifically as about what would happen if we started systematically moving relatively-unused things from core to extensions. Few people use math, but a whole lot of people probably use at least one little-used feature that could be moved to an extension. We generally haven't moved things from core to extensions AFAIK -- if we started doing it, it could cumulatively have repercussions on ease of upgrade, for no real benefit that I see.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
What if it was written as an extension and moved to /extensions? Then we get the benefit of decoupling Math from the core software, but we don't require users to download a new extension to keep their existing functionality. As long as it's clearly indicated in the RELEASE-NOTES and relevant Manual pages that you might need to update the path to Math, I don't see a huge drawback.
-Chad
On Mon, Mar 29, 2010 at 12:46 PM, Chad innocentkiller@gmail.com wrote:
What if it was written as an extension and moved to /extensions? Then we get the benefit of decoupling Math from the core software,
What benefit is this? It's not realistically decoupled from the core software unless it avoids using MediaWiki functions and classes. We have some code in core that deliberately does this, like IEContentAnalyzer.php and all of includes/normal/. Making something an extension per se doesn't change how tightly it's coupled with MediaWiki -- it's an orthogonal issue.
The only reasons I see to have extensions in trunk at all, instead of having everything in core, is that 1) it helps keep us honest about ensuring there are extension points for third parties to use, and 2) otherwise the tarball would be huge. Neither point argues for breaking things already in core out to extensions.
Aryeh Gregor wrote:
On Mon, Mar 29, 2010 at 12:46 PM, Chad wrote:
What if it was written as an extension and moved to /extensions? Then we get the benefit of decoupling Math from the core software,
What benefit is this? It's not realistically decoupled from the core software unless it avoids using MediaWiki functions and classes. We have some code in core that deliberately does this, like IEContentAnalyzer.php and all of includes/normal/. Making something an extension per se doesn't change how tightly it's coupled with MediaWiki -- it's an orthogonal issue.
The only reasons I see to have extensions in trunk at all, instead of having everything in core, is that 1) it helps keep us honest about ensuring there are extension points for third parties to use, and 2) otherwise the tarball would be huge. Neither point argues for breaking things already in core out to extensions.
<math> was implemented directly in the parser a really long time ago. That's the reason it has always been in core, despite it not being available for 99% users. Only recently, I freed the 'math' from the parser (r57997), so it can be used by another tag hook to provide the same functionality (I gave up on math and wanted to use [[Extension:Mimetex_alternative]]) and then Tim moved it out creating CoreTagHooks (r61913). It logically is an extension, much more than eg. ParserFunctions. I bet there are more users of ParserFunctions than of math. Changing to python will also break for people that compiled math, update without reading the release notes and don't have python. We can as well move it to a separate extension instead of embedding it, although that wouldn't be much an issue. The only way to ensure it will work for everyone would be to provide a PHP implementation, in which case thousands of installs will get a toolbar button suddenly working.
On Tue, Mar 30, 2010 at 2:13 PM, Platonides Platonides@gmail.com wrote:
Changing to python will also break for people that compiled math, update without reading the release notes and don't have python.
While this is ofcourse possible, how big is the chance that somebody will have ocaml but not python?
* Bryan Tong Minh bryan.tongminh@gmail.com [Tue, 30 Mar 2010 17:22:09 +0200]:
While this is ofcourse possible, how big is the chance that somebody will have ocaml but not python?
Fedora Linux has ocaml for ages. yum install ocaml or something like that. Compiling texvc is fast and easy - never had any problems. Since ocaml was developed in France, chances are bigger that it has wider spread over there. Dmitriy
Bryan Tong Minh wrote:
On Tue, Mar 30, 2010 at 2:13 PM, Platonides wrote:
Changing to python will also break for people that compiled math, update without reading the release notes and don't have python.
While this is ofcourse possible, how big is the chance that somebody will have ocaml but not python?
My point is, we shouldn't strive so much for backwards compatibility. It's possible, but extremely unlikely.
Dmitry wrote:
Fedora Linux has ocaml for ages. yum install ocaml or something like that. Compiling texvc is fast and easy - never had any problems. Since ocaml was developed in France, chances are bigger that it has wider spread over there. Dmitriy
Note that people installing mediawiki from packages will be using something like mediawiki-math package, and upgrade would be transparent for them.
On 29 March 2010 01:12, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
I have never built a wiki where texvc has been needed, wanted, or even thought harmless.
Granted, this is not as widely used as some other optional features. There are certainly many wikis that do use it, though -- it's not like no one will be affected.
I'd *like* to use it, but it's such an arse I've never yet got it working ... perhaps it's just me.
- d.
Damon Wang wrote:
Option (2) is the most maintainable and feasible option, and it's precisely the one that cannot be done in PHP. As far as I know, PHP has no parser-generator package. (Please, please let me know if that's incorrect so I can stop embarrassing myself and get on with writing a GSoC proposal.)
A quick search shows http://pear.php.net/package/PHP_ParserGenerator and http://code.google.com/p/antlrphpruntime/ Maybe they are useless, but it's worth evaluating. I suppose you could keep open the language to use in the GSoC proposal.
On 23 March 2010 19:17, Rob Lanphier robla@robla.net wrote:
As I'm sure you've already gathered from the other responses, this is exactly the right place. I'm a little skeptical myself that porting that particular piece of code from OCaml to Python is going to be a really big win for us (because it's still a "foreign" language as far as PHP-based MediaWiki is concerned, so integration is still a little clunky and performance may take a hit due to yet another interpreter needing to load), but I'll let others weigh in on whether I'm making too big a deal about that.
Getting it off Ocaml is an excellent first step. I have tried and failed to get texvc working properly in MediaWiki myself more than a few times, because of Ocaml not wanting to play nice ...
- d.
On Tue, 30 Mar 2010 16:05:02 +0300, David Gerard dgerard@gmail.com wrote:
On 23 March 2010 19:17, Rob Lanphier robla@robla.net wrote:
As I'm sure you've already gathered from the other responses, this is exactly the right place. I'm a little skeptical myself that porting that particular piece of code from OCaml to Python is going to be a really big win for us (because it's still a "foreign" language as far as PHP-based MediaWiki is concerned, so integration is still a little clunky and performance may take a hit due to yet another interpreter needing to load), but I'll let others weigh in on whether I'm making too big a deal about that.
Getting it off Ocaml is an excellent first step. I have tried and failed to get texvc working properly in MediaWiki myself more than a few times, because of Ocaml not wanting to play nice ...
Actually I completely disagree. Since I've got some experience with both OCaml and PHP the idea to convert Maths processing to PHP looks like a not so good idea at all.
Probably the issues you had were more like a wrong/problematic configuration or something like that. OCaml itself is actively developed and is a mature language and development environment, much better than PHP or Python (IMHO).
It is just interesting to wait a bit and compare the PHP and OCaml implementations of texvc (if there will be anything at all to compare).
Wish you good luck, Victor
On Tue, Mar 30, 2010 at 14:34, Victor bobbie@ua.fm wrote:
Actually I completely disagree. Since I've got some experience with both OCaml and PHP the idea to convert Maths processing to PHP looks like a not so good idea at all.
Probably the issues you had were more like a wrong/problematic configuration or something like that. OCaml itself is actively developed and is a mature language and development environment, much better than PHP or Python (IMHO).
It is just interesting to wait a bit and compare the PHP and OCaml implementations of texvc (if there will be anything at all to compare).
c is "better", so is Common Lisp, Scheme, Haskell, Clojure or a number of other languages.
The problem is that worse is better. OCaml isn't widely known among programmers or as easy for PHP programmers to get into as say Perl, Python or Ruby. As a result the math/ directory has been untouched (aside from the stray doc+bug fix) since 2003.
There are many long standing core issues with the texvc component in Bugzilla (http://bit.ly/bsSUPM) that noone is looking at.
I don't think anyone would have a problem with it remaining in OCaml if it was being maintained and these long-standing bugs were being fixed.
On Wed, Mar 31, 2010 at 10:58, Ævar Arnfjörð Bjarmason avarab@gmail.com wrote:
c is "better"
That should have been "OCaml".
Ævar Arnfjörð Bjarmason wrote:
On Wed, Mar 31, 2010 at 10:58, Ævar Arnfjörð Bjarmason avarab@gmail.com wrote:
c is "better"
That should have been "OCaml".
No it shouldn't! :)
On Wed, 31 Mar 2010 13:58:03 +0300, Ævar Arnfjörð Bjarmason avarab@gmail.com wrote:
On Tue, Mar 30, 2010 at 14:34, Victor bobbie@ua.fm wrote:
Actually I completely disagree. Since I've got some experience with both OCaml and PHP the idea to convert Maths processing to PHP looks like a not so good idea at all.
Probably the issues you had were more like a wrong/problematic configuration or something like that. OCaml itself is actively developed and is a mature language and development environment, much better than PHP or Python (IMHO).
It is just interesting to wait a bit and compare the PHP and OCaml implementations of texvc (if there will be anything at all to compare).
c is "better", so is Common Lisp, Scheme, Haskell, Clojure or a number of other languages.
The problem is that worse is better. OCaml isn't widely known among programmers or as easy for PHP programmers to get into as say Perl, Python or Ruby. As a result the math/ directory has been untouched (aside from the stray doc+bug fix) since 2003.
There are many long standing core issues with the texvc component in Bugzilla (http://bit.ly/bsSUPM) that noone is looking at.
I don't think anyone would have a problem with it remaining in OCaml if it was being maintained and these long-standing bugs were being fixed.
Hi, now I see...
I've posted a message to the fa.caml newsgroup: http://groups.google.com/group/fa.caml/browse_frm/thread/1593e053759d7679
hopefully somebody will volunteer to fix the issues, thus saving the human resources for better tasks.
With best regards, Victor
On 03/31/2010 12:31 PM, Victor wrote:
Hi, now I see...
I've posted a message to the fa.caml newsgroup: http://groups.google.com/group/fa.caml/browse_frm/thread/1593e053759d7679
hopefully somebody will volunteer to fix the issues, thus saving the human resources for better tasks.
With best regards, Victor
While I'll no doubt regret saying this, I am happy to fix some of these bugs. With the majority of these it's harder to decide "should we fix" and "how should we fix" rather than actually being hard to implement.
What I don't want to do is fix things to find that it then gets immediately re-implemented in PHP, which seems to be what people want.
Some of the issues have LaTeX dependencies, particularly "support Unicode", so fixing them could be dangerous unless we conditionally include support for them, i.e. by running a feature-test as part of the installation (we should probably do that anyway, so that we can provide nicer error messages to the user if they are missing one of the other dependencies).
Conrad
Conrad Irwin wrote:
On 03/31/2010 12:31 PM, Victor wrote:
Hi, now I see...
I've posted a message to the fa.caml newsgroup: http://groups.google.com/group/fa.caml/browse_frm/thread/1593e053759d7679
hopefully somebody will volunteer to fix the issues, thus saving the human resources for better tasks.
While I'll no doubt regret saying this, I am happy to fix some of these bugs. With the majority of these it's harder to decide "should we fix" and "how should we fix" rather than actually being hard to implement.
What I don't want to do is fix things to find that it then gets immediately re-implemented in PHP, which seems to be what people want.
I don't think "reimplement texvc in PHP" is anyone's goal as such. The real goal is "make texvc actively maintained". Reimplementing it in PHP would be one way to achieve that, since we have plenty of PHP programmers who could then maintain it. Finding some person or people who know OCaml and are willing to do the work would be another route to the same end.
In general, there's a form of decision paralysis common to volunteer projects, particularly ones relying on skilled volunteers. Basically, there are two solutions to a problem, X and Y:
Person A says "I can try to do X, but I don't want to spend the time if we're just going to do Y instead."
Person B says "I can try to do Y, but I don't want to spend the time if we're just going to do X instead."
No-one else wants to commit to either X or Y, since they want to keep the other option open in case A or B doesn't succeed with their favored approach after all.
End result is that neither X nor Y actually gets done.
There are two ways out of this situation: either the project needs to commit to one option and make sure it gets done, or A and B need to accept the risk that they might end up doing redundant work. Ironically, the meta-decision on whether to commit to one approach or try both can also get suck in a similar dilemma on a higher level.
Going back to the concrete issue here, I'd personally recommend trying both _for now_. In particular, rewriting texvc in Python or PHP, as a MediaWiki extension, would seem like good GSoC project even if it didn't actually end up being adopted into MW core in the end.
Meanwhile, fixing at least the simplest and most critical issues in the current OCaml implementation would also be of immediate value, even if that implementation might possibly end up being replaced at some point off in the future. I wouldn't necessarily recommend going immediately for the more tricky issues, or the ones with lower short-term benefit per effort, but I'm sure there must be some low-hanging fruit ready to be picked by anyone who's simply familiar with the language.
By autumn, we ought to have some idea how much, if any, progress has been made with each approach. At that point, we should be better able to decide whether to commit to one approach or the other, and if so, which.
I should also note that, as long as one implementation doesn't _completely_ supersede the other in every way, there would probably be people interested in using each of them if they were available as optional extensions. In particular, I'm sure there are people who have access to Python but haven't managed to set up OCaml -- and I wouldn't be completely surprised if the opposite turned out to be also true.
Of course, that's just my opinion as a random occasional contributor. Take it with as much salt as you think appropriate.
On 30 March 2010 16:34, Victor bobbie@ua.fm wrote: ....
Getting it off Ocaml is an excellent first step. I have tried and failed to get texvc working properly in MediaWiki myself more than a few times, because of Ocaml not wanting to play nice ...
Actually I completely disagree. Since I've got some experience with both OCaml and PHP the idea to convert Maths processing to PHP looks like a not so good idea at all.
Doing Math in any programming language or digital computer is a bad idea. Anyway.
I could be worse, It could be Math in Javascript:
v = (011 + "1" + 0.1)/3;
303.3666666666667
On Wed, Mar 31, 2010 at 14:24, Tei oscar.vives@gmail.com wrote:
Doing Math in any programming language or digital computer is a bad idea. Anyway.
The texvc component doesn't "do math". It just sanitizes LaTeX and passes it off to have a PNG generated from it.
I could be worse, It could be Math in Javascript: v = (011 + "1" + 0.1)/3; 303.3666666666667
Somebody ought to slap you for mixing four different types in such a horrendous manner to construct something that is supposed to make people who do not understand octal numbers and string concatenation to think JavaScript is insane. *shakes head*
On 03/31/2010 05:32 PM, Daniel Schwen wrote:
I could be worse, It could be Math in Javascript: v = (011 + "1" + 0.1)/3; 303.3666666666667
Somebody ought to slap you for mixing four different types in such a horrendous manner to construct something that is supposed to make people who do not understand octal numbers and string concatenation to think JavaScript is insane. *shakes head*
The point should stand even if there's faulty reasoning:
([8, 32, 189].sort()[0]) === (new Boolean(false) ? 189 : 8)
Conrad
([8, 32, 189].sort()[0]) === (new Boolean(false) ? 189 : 8)
Why such a contrived example?! This all boils down to
new Boolean(false) == false returning true
new Boolean(false) === false returning false
The ?: operator is just not performing an implicit cast. The fact that the Boolean object is different from the Boolean primitive type is... ...well unfortunate. That's why it's use is as an Object is deprecated since JavaScript 1.3 You example gives the expected outcome if you call Boolean as a function: ([8, 32, 189].sort()[0]) === (Boolean(false) ? 189 : 8)
So, can we drop this dreaded debate now?
2010/3/31 Daniel Schwen lists@schwen.de:
([8, 32, 189].sort()[0]) === (new Boolean(false) ? 189 : 8)
Why such a contrived example?! This all boils down to
new Boolean(false) == false returning true
new Boolean(false) === false returning false
Not quite. There's also the issue of [8, 32, 189].sort() returning [189, 32, 8]. The real fun is that a "sane" interpretation of both operands yields 8 === 8 , which is true, whereas the expression really evaluates to true via 189 === 189.
But I digress. We should indeed be talking about GSoC in this thread.
Roan Kattouw (Catrope)
On 30 March 2010 15:34, Victor bobbie@ua.fm wrote:
On Tue, 30 Mar 2010 16:05:02 +0300, David Gerard dgerard@gmail.com wrote:
Getting it off Ocaml is an excellent first step. I have tried and failed to get texvc working properly in MediaWiki myself more than a few times, because of Ocaml not wanting to play nice ...
Actually I completely disagree. Since I've got some experience with both OCaml and PHP the idea to convert Maths processing to PHP looks like a not so good idea at all. Probably the issues you had were more like a wrong/problematic configuration or something like that. OCaml itself is actively developed and is a mature language and development environment, much better than PHP or Python (IMHO).
Oh, I don't doubt that at all. It just didn't work for me :-)
- d.
Python is a nice language. PHP (portability) or C/C++ (speed) would be better but Python is preferable to OCaml.
You mention ANTLR, something like that could be a good because it should allow to generate the same parser in a different language with not so much effort (probably you won't have enough time in gsoc for that, but a design taking that option into account would be interesting).
So you could do (please don't take this as a requisites list): *Figure out wth is doing the current texvc. *Document it heavily. *Design how to create the next textvc. *Any parser you make for it. *Actual implementation.
You seem to be thinking about creating a PHP extension. I don't think you should go that route. A binary is good enough, we don't need it to be in a PHP extension. That glue could be added later if needed, but would increase the complexity to write and debug.
"Platonides" Platonides@gmail.com wrote in message news:hobfpi$4ud$1@dough.gmane.org...
You seem to be thinking about creating a PHP extension. I don't think you should go that route. A binary is good enough, we don't need it to be in a PHP extension. That glue could be added later if needed, but would increase the complexity to write and debug.
I took it to mean that he wanted to split the math parsing out as a **MediaWiki** extension, implementing <math> as a parser tag hook in the usual way. Which is definitely highly desirable.
--HM
Happy-melon wrote:
I took it to mean that he wanted to split the math parsing out as a **MediaWiki** extension, implementing <math> as a parser tag hook in the usual way. Which is definitely highly desirable.
--HM
Making it a MediaWiki extension is of course desirable (moving texvc out of core is a pending issue, at least now <math> can be used by extensions).
but Damon wrote:
Another possibility be writing it in C to avoid all interpreter overhead, and using a foreign function interface. Unfortunately, I'm not familiar with PHP's FFI. Google takes me to http://wiki.php.net/rfc/php_native_interface which seems to think that as of a year ago there weren't any good ones, but this doesn't look too painful: http://theserverpages.com/php/manual/en/zend.creating.php
That's about PHP extensions (which are written in C). So, instead of going that path, he should make a C program which does what texvc does. It can then be moved into a PHP extension if really needed, but starting with Zend extensions would be an unneeded pain for this project.
On Tue, Mar 23, 2010 at 9:06 AM, Damon Wang damonwang@uchicago.edu wrote:
Hello everyone,
I'm interested in porting texvc to Python, and I was hoping this list here might help me hash out the plan. Please let me know if I should take my questions elsewhere.
If I understand correctly, you want to write a <insert language here> script that validates latex, and calls the latex compiler. Why can't the validator be written as an integral extension for MediaWiki itself and have MediaWiki call the latex compiler. Is there a particular reason to have the validation done by an external program?
Bryan
wikitech-l@lists.wikimedia.org