Thank you, Sage, for your reply:
... I've been chatting with the folks working on this, and they are actually quite close to having a usable API for estimated article quality — which I'm super excited about building into our dashboard. The human part of it will be down the road a bit, but the main purpose there will be to continually improve the model by having experienced editors create good ratings data for training the model. But I expect that there won't be much trouble in finding Wikipedians to pitch on that.
I had actually been exploring the idea of setting up a crowdsourcing system where we might pay experienced editors to do before and after ratings for student work, but at this point I'm much more enthusiastic about the machine learning approach that the revision-scoring-as-a-service project is taking — since that is easy to scale up and maintain long term.
I recommend measuring the optimal amount of human input and review. It is very substantially nonzero if you want to maximize the encyclopedia's utility function. There is really nobody at the WEF who wants to try to co-mentor accuracy review? What if there was a cap on total hours needed. I am sure you wouldn't regret it, but I am also happy to continue on my own for the time being.
Best regards, James