The result will be not only better tests, but a better impact for
users, because we will actually be able to deploy the improvements we
have worked on

That is the hope and promise of the relevance lab. And it can test more than the ZRR.

We already have the ability to check the rates of change to the top n results (ignoring order or not). So, you can see if anything breaks into the top 5 results (ignoring shuffling among the top 5), or note any change at all in the top 5 (including reshuffling). Mechanically, you can see that, for example, a change only affects the top 5 results for 0.1% of your query corpus, so regardless of whether it is good or bad, it isn't super high impact.

The relevance lab report also provides links to diffs of example queries that are affected by a change, so you can review them to get a sense of whether they are good or bad. It's subjective, but you can sometimes get a rough sense of things by looking at a few dozen randomly chosen examples. If you look at 25 examples and 90% of them are clearly worse because of the change, you know you need to fix something, even with the giant confidence interval such a small sample entails.

So, even without a gold standard corpus of graded search results, you can use the relevance lab to do some pre-testing of a change and get a sense of whether it's doing what you want on general queries (and not just the handful you were focused on trying to fix).

You can also test impact and effectiveness on a focused query corpus of rarer query types (e.g., queries of more than 10 words).

And adding other metrics is pretty straightforward, if anyone has any ideas of places to take note of changes. And, on the back burner, I have some ideas for improvements that look at other annotations on a query and how to incorporate/test those in the relevance lab.

So, I agree with you on the relevance lab side. I'm also looking forward to better user acceptance testing, whether through more complex click-stream metrics or micro surveys or whatever else works. We collectively suffer from the curse of knowledge when it comes to search—it's hard to know what users who don't spend a non-trivial portion of their professional lives contemplating search will really like/want/use.

—Trey
 
Trey Jones
Software Engineer, Discovery
Wikimedia Foundation


On Thu, Dec 10, 2015 at 10:39 AM, Oliver Keyes <okeyes@wikimedia.org> wrote:
The title is mostly to get you're attention, you know I like A/B
testing. With that being said:

For a quarter and a bit we've been running A/B tests. Doing so has
been intensely time-consuming for both engineering and analysis, and
at times it's felt like we're pushing changes out just to test them,
rather than because we have reason to believe there will be dramatic
improvements.

These tests have produced, at best, mixed results. Many of the tests
have not shown a substantial improvement in the metric we have been
testing - the zero results rate. Those that have have not been
deployed further because we cannot, from the ZRR alone, test the
_utility_ of produced results: for that we need to A/B test against
clickthroughs, or a satisfaction metric.

So where do we go from here?

In my mind, the ideal is that we stop A/B testing against the zero results rate.

This doesn't mean we stop testing improvements: this means we build
the relevance lab up and out and test the zero results rate against
/that/. ZRR does not need user participation, it needs the
participation of user *queries*: with the relevance lab we can consume
user queries and test ideas against them at a fraction of the cost of
a full A/B test.

Instead, we use the A/B tests for the other component: the utility
component. If something passes the Relevance Lab ring of fire, we A/B
test it against clickthroughs: this will be rarer than "every two
weeks" and so we can afford to spend some time making sure the test is
A+ scientifically, and all our ducks are in a row.

The result will be not only better tests, but a better impact for
users, because we will actually be able to deploy the improvements we
have worked on - something that has thus far escaped us due to
attention being focused on deploying More Tests rather than completely
validating the ones we have already deployed.

Thoughts?

--
Oliver Keyes
Count Logula
Wikimedia Foundation

_______________________________________________
discovery mailing list
discovery@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery