The result will be not only better tests, but a better impact for
users, because we will actually be able to deploy the improvements we
have worked on
That is the hope and promise of the relevance lab. And it can test more than the ZRR.
We already have the ability to check the rates of change to the top n results (ignoring order or not). So, you can see if anything breaks into the top 5 results (ignoring shuffling among the top 5), or note any change at all in the top 5 (including reshuffling). Mechanically, you can see that, for example, a change only affects the top 5 results for 0.1% of your query corpus, so regardless of whether it is good or bad, it isn't super high impact.
The relevance lab report also provides links to diffs of example queries that are affected by a change, so you can review them to get a sense of whether they are good or bad. It's subjective, but you can sometimes get a rough sense of things by looking at a few dozen randomly chosen examples. If you look at 25 examples and 90% of them are clearly worse because of the change, you know you need to fix something, even with the giant confidence interval such a small sample entails.
So, even without a gold standard corpus of graded search results, you can use the relevance lab to do some pre-testing of a change and get a sense of whether it's doing what you want on general queries (and not just the handful you were focused on trying to fix).
You can also test impact and effectiveness on a focused query corpus of rarer query types (e.g., queries of more than 10 words).
And adding other metrics is pretty straightforward, if anyone has any ideas of places to take note of changes. And, on the back burner, I have some ideas for improvements that look at other annotations on a query and how to incorporate/test those in the relevance lab.
So, I agree with you on the relevance lab side. I'm also looking forward to better user acceptance testing, whether through more complex click-stream metrics or micro surveys or whatever else works. We collectively suffer from the curse of knowledge when it comes to search—it's hard to know what users who don't spend a non-trivial portion of their professional lives contemplating search will really like/want/use.
—Trey