The (rough) epic definition is already on Phabricator:
https://phabricator.wikimedia.org/T98970.
I've defined some metrics there already, but admittedly—and thanks for calling us out on this—we don't really have baselines*. I think there are some feasible ways to get a rough starting point, which I can brainstorm w/ the team. We were planning (or I should say, I was hoping) to gather more code metrics anyway, so I'm glad to have an excuse to hook it up sooner ;-).
FWIW, I also think
having patches tested as part of code review would also work as a sufficient definition of success. Our goal here is to do that as quickly, easily, and cheaply as possible so we can get back to focusing on the app.
* I think it's fair to say that the coverage at point of migration was already low (~10% based on my Travis-covered fork) and hasn't changed much.