On Fri, Sep 29, 2017 at 6:48 AM, mathieu stumpf guntz <
psychoslave(a)culture-libre.org> wrote:
By the way is there an official policy or whatever
document regarding
Scribunto evolutions?
Not that I know of. The biggest technical blocker to having Scribunto use a
newer version of Lua is that 5.2 heavily changed how function environments
work, so we'd have to redo the sandboxing and put it through a fresh
security review.
Ok, thank you. I guessed that each Scribunto process
was hugely sandboxed,
especially as everything seems to be done to prevent passing information
between successive invocations of the same module. I hadn't thought of
possible side effect on PHP execution as explained in the ticket.
The problem with os.setlocale is that it's global to the whole process, not
inside the sandbox. When using luastandalone that's less of an issue since
the Lua code runs in a separate process (but we still don't start a new
process for each #invoke on the page), but when running with the luasandbox
PHP extension it shares the process.
Do we have some nice (or even ugly) schema of
PHP/Scribunto execution
process so I could have a clearer representation of what's happening when I
grab a webpage of a mediawiki article with some Scribunto invocation?
Not really. When the parser processes the {{#invoke:}}, it calls
ScribuntoHooks::invokeHook() which loads the module invoked, initializes
it, then calls the method invoked.
But that's not the concern I was writing for. That
is, I can't use unicode
identifiers as in `locale plâtrière = préamorçage()`. When I see UTF-8
somewhere, I would expect no problem to use any glyph. So are my
expectations misguided, or is there something wrong with the way C.UTF-8 is
handled somewhere in the software stack?
Lua's processing operates on C chars (i.e. bytes), and uses C's isalpha()
and isalnum() to recognize which characters are "letters" for the purpose
of identifiers. For single-byte encodings this allows non-ASCII characters
such as 'â', 'è', 'é', and 'ç' to be recognized as
"letters", hence the
documentation in Lua 5.1 about that, but in UTF-8 these are all represented
with multiple bytes so that doesn't work.
Changing that would require rewriting all the Lua input processing to use
functions that can handle "wide" characters, which is well beyond what
we're at all likely to do. It'd have to happen upstream, and then we'd have
to spend the time to actually upgrade to Lua 5.4 or whatever version
implemented it. But since Lua 5.2 actually changed things the other way
("Lua identifiers cannot use locale-dependent letters",
https://www.lua.org/manual/5.2/manual.html#8.1) that too seems unlikely.
--
Brad Jorsch (Anomie)
Senior Software Engineer
Wikimedia Foundation