Hey all!
While working on some maintenance scripts for TimedMediaHandler I've been trying to make it easier to do scripts that use multiple parallel processes to run through a large input set faster.
My proposal is a ForkableMaintenance class, with an underlying QueueingForkController which is a refactoring of the OrderedStreamingForkController used by (at least) some CirrusSearch maintenance scripts.
Patch in progress: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/451099/
The expected use case is a relatively long loop of reading input data interleaved with running CPU-intensive or DB-intensive jobs, where the individual jobs are independent and order of input is not strongly coupled. (Ordering of _running_ of items on the child processes is not guaranteed, but the order of result processing is guaranteed to be in the same order as input, making output across runs predictable for idempotent processing.)
A simple ForkableMaintenance script might look like:
class Foo extends ForkableMaintenance { // Handle any input on the parent thread, and // pass any data as JSON-serializable form into // the queue() method, where it gets funneled into // a child process. public function loop() { for ( $i = 0; $i < 1000; $i++) { $this->queue( $i ); } }
// On the child process, receives the queued value // via JSON encode/decode. Here it's a number. public function work( $count ) { return str_repeat( '*', $count ); }
// On the parent thread, receives the work() return value // via JSON encode/decode. Here it's a string. public function result( $data ) { $this->output( $data . "\n" ); } }
Because data is serialized as JSON and sent over a pipe, you can't send live objects like Titles or Pages or Files, but you can send arrays or associative arrays with fairly complex data.
There is a little per-job overhead and multiple threads can cause more contention on the database server etc, but it scales well on the subtitle format conversion script I'm testing with, which is a little DB loading and some CPU work. On an 8-core/16-thread test server:
threads runtime (s) speedup 0 320 n/a 1 324 0.987654321 2 183 1.74863388 4 105 3.047619048 8 66 4.848484848 16 58 5.517241379
I've added a phpunit test case for OrderedStreamingForkController to make sure I don't regress something used by other teams, but noticed a couple problems with using this fork stuff in the test runners.
First, doing pcntl_fork() inside a phpunit test case has some potential side effects, since there's a lot of tear-down work done in destructors even if you call exit() after processing completes. As a workaround, when I'm having the child procs send a SIGKILL to themselves to terminate immediately without running test-case destructors.
Second, probably related to that I'm seeing a failure in the code coverage calculations -- it's seeing some increased coverage on the parent process at least but seems to think it's returning a non-zero exit code somewhere, which marks the whole operation as a failure:
https://integration.wikimedia.org/ci/job/mediawiki-phpunit-coverage-patch-do...
Worst case, can probably exclude these from some automated tests but that always seems like a bad idea. :D
If anybody else is using, or thinking about using, ForkController and its friends and wants to help polish this up, give a shout!
-- brion