On Tue, Jun 30, 2009 at 11:46 PM, Tim Starling<tstarling(a)wikimedia.org> wrote:
[snip]
SpiderMonkey and Python both lack control over memory
usage. Python
lacks a sandbox mode, the rexec module has been removed. SpiderMonkey
isn't embedded in any useful kind of standalone, so you'd have to
start with a C development project, like you would for Lua.
Cpython has about a billion ways to inject machine code, this is one
reason why Rpython failed.
If you were to do python it would probably need to be embedded in java.
For spidermonkey the model I would have envisioned is a separate
script executor daemon which spawns thread-per-script (with limits to
keep the peak thread count reasonable) and arbitrates communication
with mediawiki over sockets. Memory limits then become a simple
exercise in providing an instrumented malloc and setting the thread
stack size appropriately.
This model has the advantage for big installations that script
processing can be compartmentalized and run only on certan systems or
only on certain cores. It would also allow the scripting process to be
more highly compartmentalized than PHP is, since its would only need
to be able to SBRK and read/write some sockets. (i.e
http://en.wikipedia.org/wiki/Seccomp )
Another reason why using a narrow pipe interface is that it would be
possible to distinguish scripts which are a proper function on their
inputs from ones that aren't, and a narrow pipe interface makes it
easier to enforce those limits:
For example, there could be three script modes:
Function
Function+Date
Not-function
Functions are guaranteed to produce constant output for their input,
and their input can't include anything which is more volatile than
page editing. (i.e. no time/date as an input, no time/pid triggered
rand(), no retrieving data from logs or other pages). The output from
these could be trivially cached based on a hash of the input
arguments.
Function+date is like the above, but they also have access to the
current date (but not time). These could be cached but the cache would
be invalidated every day. This could be generalized further where the
script prototype could specify the available inputs. (i.e. is this a
function on page specific data, or is this just some formatting
template which works universally?)
Not-function means without those limits.
The different types of script could have resource limits, execution
priorities, and site policy controls. For example, wikimedia might
only allow function, function+revision_info for performance reasons.