On Fri, Sep 29, 2017 at 6:48 AM, mathieu stumpf guntz < psychoslave@culture-libre.org> wrote:
By the way is there an official policy or whatever document regarding Scribunto evolutions?
Not that I know of. The biggest technical blocker to having Scribunto use a newer version of Lua is that 5.2 heavily changed how function environments work, so we'd have to redo the sandboxing and put it through a fresh security review.
Ok, thank you. I guessed that each Scribunto process was hugely sandboxed, especially as everything seems to be done to prevent passing information between successive invocations of the same module. I hadn't thought of possible side effect on PHP execution as explained in the ticket.
The problem with os.setlocale is that it's global to the whole process, not inside the sandbox. When using luastandalone that's less of an issue since the Lua code runs in a separate process (but we still don't start a new process for each #invoke on the page), but when running with the luasandbox PHP extension it shares the process.
Do we have some nice (or even ugly) schema of PHP/Scribunto execution process so I could have a clearer representation of what's happening when I grab a webpage of a mediawiki article with some Scribunto invocation?
Not really. When the parser processes the {{#invoke:}}, it calls ScribuntoHooks::invokeHook() which loads the module invoked, initializes it, then calls the method invoked.
But that's not the concern I was writing for. That is, I can't use unicode identifiers as in `locale plâtrière = préamorçage()`. When I see UTF-8 somewhere, I would expect no problem to use any glyph. So are my expectations misguided, or is there something wrong with the way C.UTF-8 is handled somewhere in the software stack?
Lua's processing operates on C chars (i.e. bytes), and uses C's isalpha() and isalnum() to recognize which characters are "letters" for the purpose of identifiers. For single-byte encodings this allows non-ASCII characters such as 'â', 'è', 'é', and 'ç' to be recognized as "letters", hence the documentation in Lua 5.1 about that, but in UTF-8 these are all represented with multiple bytes so that doesn't work.
Changing that would require rewriting all the Lua input processing to use functions that can handle "wide" characters, which is well beyond what we're at all likely to do. It'd have to happen upstream, and then we'd have to spend the time to actually upgrade to Lua 5.4 or whatever version implemented it. But since Lua 5.2 actually changed things the other way ("Lua identifiers cannot use locale-dependent letters", https://www.lua.org/manual/5.2/manual.html#8.1) that too seems unlikely.