On Tue, Jun 30, 2009 at 11:46 PM, Tim Starlingtstarling@wikimedia.org wrote: [snip]
SpiderMonkey and Python both lack control over memory usage. Python lacks a sandbox mode, the rexec module has been removed. SpiderMonkey isn't embedded in any useful kind of standalone, so you'd have to start with a C development project, like you would for Lua.
Cpython has about a billion ways to inject machine code, this is one reason why Rpython failed. If you were to do python it would probably need to be embedded in java.
For spidermonkey the model I would have envisioned is a separate script executor daemon which spawns thread-per-script (with limits to keep the peak thread count reasonable) and arbitrates communication with mediawiki over sockets. Memory limits then become a simple exercise in providing an instrumented malloc and setting the thread stack size appropriately.
This model has the advantage for big installations that script processing can be compartmentalized and run only on certan systems or only on certain cores. It would also allow the scripting process to be more highly compartmentalized than PHP is, since its would only need to be able to SBRK and read/write some sockets. (i.e http://en.wikipedia.org/wiki/Seccomp )
Another reason why using a narrow pipe interface is that it would be possible to distinguish scripts which are a proper function on their inputs from ones that aren't, and a narrow pipe interface makes it easier to enforce those limits:
For example, there could be three script modes: Function Function+Date Not-function
Functions are guaranteed to produce constant output for their input, and their input can't include anything which is more volatile than page editing. (i.e. no time/date as an input, no time/pid triggered rand(), no retrieving data from logs or other pages). The output from these could be trivially cached based on a hash of the input arguments.
Function+date is like the above, but they also have access to the current date (but not time). These could be cached but the cache would be invalidated every day. This could be generalized further where the script prototype could specify the available inputs. (i.e. is this a function on page specific data, or is this just some formatting template which works universally?)
Not-function means without those limits.
The different types of script could have resource limits, execution priorities, and site policy controls. For example, wikimedia might only allow function, function+revision_info for performance reasons.