On Tue, Jun 30, 2009 at 12:56 PM, Aryeh GregorSimetrical+wikilist@gmail.com wrote:
On Tue, Jun 30, 2009 at 12:16 PM, Brion Vibberbrion@wikimedia.org wrote:
- PHP
Advantage: Lots of webbish people have some experience with PHP or can easily find references.
Advantage: we're pretty much guaranteed to have a PHP interpreter available. :)
Disadvantage: PHP is difficult to lock down for secure execution.
I think it would be easy to provide a very simple locked-down version, with most of the features gone. You could, for instance, only permit variable assignment, use of built-in operators, a small whitelist of functions, and conditionals. You could omit loops, function definitions, and abusable functions like str_repeat() (let alone exec(), eval(), etc.) from a first pass. This would still be vastly more powerful, more readable, and faster than ParserFunctions.
Hopefully, we could make this secure enough for your average shared-host website to run it by default with no special measures taken and without much risk. Installations with more access and higher security requirements, like Wikimedia, could shell out to a process that's sandboxed on the OS level to be on the safe side. I'd like to hear what Tim thinks about the possibility of securing PHP like this.
Of course, PHP is evil, and supporting it sucks. :( But if we *really* *really* need to support users who can't shell out to other programs, I think it's the only real language that's a feasible solution.
I'd encourage you to consider requiring exec() support for full use of Wikipedia templates, though. Many really big shared hosts allow it, like 1and1.com. Anyone big enough to include much Wikipedia content will likely be on at least a VPS anyway. And if your host doesn't support exec(), then at *worst* you can still get the articles in a totally usable form -- just run Special:ExpandTemplates on all the article's templates. You can then transclude those on a per-article basis; we could update Special:Export to make this easier. The only problem in this case would be that you can't easily change the formatting of all the templates at once -- but such a small site would likely have few enough articles to do it by hand, if they even want to.
I think saying that users without exec() support get to use Wikipedia content in a somewhat less usable form would be just fine, and it would *really* open up our options. We could support basically any programming language in that case.
- Python
Advantage: A Python interpreter will be present on most web servers, though not necessarily all. (Windows-based servers especially.)
Wash: Python is probably better known than Lua, but not as well as PHP or JS.
Disadvantage: Like PHP, Python is difficult to lock down securely.
It doesn't matter whether it's present, does it? If the user has exec() support, they could download a binary interpreter for *any* language to their webspace and run it from there regardless of whether the language is supported on the host. So Python is on exactly the same level as Lua here.
Much though I love Python, Lua looks like the better option. First of all, it's *very* small. sudo apt-get install lua50 on my machine uses up only 180 KB of disk space, and the package is 30 KB gzipped. Our current tarballs are 10 MB; we could easily just chuck in Lua binaries for Linux x86-32 and Windows without even noticing the size increase, and allow users to enable it with one line in LocalSettings.php. By contrast, python2.6 is around 10 MB uncompressed, 2.5 MB compressed. Perl is twice that size. Windows users, or users with exec() allowed but open_basedir preventing access to /usr/bin, would have to obtain Python/Perl/etc. themselves.
It looks to me like Lua would be a lot easier to sandbox. It seems pretty simple to deny all I/O within the language itself, so you'd (hopefully) just need memory and CPU limits. Both of those could be implemented on Linux with hard setrlimit() values plus nice. Similar things exist on Windows, hopefully accessible by command line somehow. If we're shipping binaries with MediaWiki, we could even hack the code if necessary, to use whatever sandboxing mechanisms the OS makes available, although hopefully that would be unneeded.
I don't think we should fixate too much on how many people know the language. It's not hard to pick up a new language if you already know one, and Lua has the reputation of being simple (although I haven't tried to learn it). I think Lua is the best option here.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
In addition to resource limits, any scheme better make sure what's passed into the programming language and what's passed out makes sense. For example, you shouldn't have it generating raw HTML and probably shouldn't let it mess with strip markers. Some of this may be automatic depending how it's integrated into the parser. One would probably also want to limit the size of an allowed output (e.g. don't let it send 5 MB to the user). Depending on the integration there may be other control sequences that one needs to catch when it returns as well.
On a separate point, one of the limitations of stand-alone type sandboxes is that it would make it harder for the code to call other template pages. One of the few virtues of the current template code is that it is relatively modular, with more complex templates being built out of less complex ones. If this programming language is meant to replace that then it would also need to be able to reference the results of other template pages. One solution is to pre-expand those sections (similar to what is done now, I believe), but that can get rather delicate once one has programming constructs like variable assignments, looping, and recursion since the template parameters won't necessarily be fixed at the Preprocessor stage.
-Robert Rohde