Collecting human labeled relevance judgements for search from readers

Discovery Weekly Update for the...

Erik Bernhardson

3 May 2017 3 May '17

9:24 p.m.

At our weekly relevance meeting an interesting idea came up about how to collect relevance judgements for the long tail of queries, which make up around 60% of search sessions. We are pondering asking questions on the article pages themselves. Roughly we would manually curate some list of queries we want to collect relevance judgements for. When a user has spent some threshold of time (60s?) on a page we would, for some % of users, check if we have any queries we want labeled for this page, and then ask them if the page is a relevant result for that query. In this way the amount of work asked of individuals is relatively low and hopefully something they can answer without too much work. We know that the average page receives a few thousand page views per day, so even with a relatively low response rate we could probably collect a reasonable number of judgements over some medium length time period (weeks?) These labels would almost certainly be noisy, we would need to collect the same judgement many times to get any kind of certainty on the label. Additionally we would not be able to really explain the nuances of a grading scale with many points, we would probably have to use either a thumbs up/thumbs down approach, or maybe a happy/sad/indifferent smiley face. Does this seem reasonable? Are there other ways we could go about collecting the same data? How to design it in a non-intrusive manner that gets results, but doesn't annoy users? Other thoughts? For some background: * We are currently generating labeled data using statistical analysis (clickmodels) against historical click data. This analysis requires there to be multiple search sessions with the same query presented with similar results to estimate the relevance of those results. A manual review of the results showed queries with clicks from at least 10 sessions had reasonable but not great labels, queries with 35+ sessions looked pretty good, and queries with hundreds of sessions were labeled really well. * an analysis of 80 days worth of search click logs showed that 35 to 40% of search sessions are for queries that are repeated more than 10 times in that 80 day period. Around 20% of search session are for queries that are repeated more than 35 times in that 80 day period. ( https://phabricator.wikimedia.org/P5371) * Our privacy policy prevents us from keeping more than 90 days worth of data from which to run these clickmodels. Practically 80 days is probably a reasonable cutoff, as we will want to re-use the data multiple times before needing to delete it and generate a new set of labels. * We currently collect human relevance judgements with Discernatron ( https://discernatron.wmflabs.org/). This is useful data for manual evaluation of changes, but the data set is much too small (low hundreds of queries, with an average of 50 documents per query) to integrate into machine learning. The process of judging query/document pairs for the community is quite tedious, and it doesn't seem like a great use of engineer time for us to do this ourselves.

Attachments:

attachment.htm (text/html — 3.3 KB)

Show replies by date

Jonathan Morgan

3 May 3 May

9:44 p.m.

New subject: [AI] Collecting human labeled relevance judgements for search from readers

Hi Erik, I've been using some similar methods to evaluate Related Article recommendations <https://meta.wikimedia.org/wiki/Research:Evaluating_RelatedArticles_recommendations> and the source of the trending article card <https://meta.wikimedia.org/wiki/Research:Comparing_most_read_and_trending_edits_for_Top_Articles_feature> in the Explore feed on Android. Let me know if you'd like to sit down and chat about experimental design sometime. - J On Wed, May 3, 2017 at 12:24 PM, Erik Bernhardson < ebernhardson(a)wikimedia.org> wrote:

...

At our weekly relevance meeting an interesting idea came up about how to collect relevance judgements for the long tail of queries, which make up around 60% of search sessions. We are pondering asking questions on the article pages themselves. Roughly we would manually curate some list of queries we want to collect relevance judgements for. When a user has spent some threshold of time (60s?) on a page we would, for some % of users, check if we have any queries we want labeled for this page, and then ask them if the page is a relevant result for that query. In this way the amount of work asked of individuals is relatively low and hopefully something they can answer without too much work. We know that the average page receives a few thousand page views per day, so even with a relatively low response rate we could probably collect a reasonable number of judgements over some medium length time period (weeks?) These labels would almost certainly be noisy, we would need to collect the same judgement many times to get any kind of certainty on the label. Additionally we would not be able to really explain the nuances of a grading scale with many points, we would probably have to use either a thumbs up/thumbs down approach, or maybe a happy/sad/indifferent smiley face. Does this seem reasonable? Are there other ways we could go about collecting the same data? How to design it in a non-intrusive manner that gets results, but doesn't annoy users? Other thoughts? For some background: * We are currently generating labeled data using statistical analysis (clickmodels) against historical click data. This analysis requires there to be multiple search sessions with the same query presented with similar results to estimate the relevance of those results. A manual review of the results showed queries with clicks from at least 10 sessions had reasonable but not great labels, queries with 35+ sessions looked pretty good, and queries with hundreds of sessions were labeled really well. * an analysis of 80 days worth of search click logs showed that 35 to 40% of search sessions are for queries that are repeated more than 10 times in that 80 day period. Around 20% of search session are for queries that are repeated more than 35 times in that 80 day period. (https://phabricator. wikimedia.org/P5371) * Our privacy policy prevents us from keeping more than 90 days worth of data from which to run these clickmodels. Practically 80 days is probably a reasonable cutoff, as we will want to re-use the data multiple times before needing to delete it and generate a new set of labels. * We currently collect human relevance judgements with Discernatron ( https://discernatron.wmflabs.org/). This is useful data for manual evaluation of changes, but the data set is much too small (low hundreds of queries, with an average of 50 documents per query) to integrate into machine learning. The process of judging query/document pairs for the community is quite tedious, and it doesn't seem like a great use of engineer time for us to do this ourselves. _______________________________________________ AI mailing list AI(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ai

-- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>

Erik Bernhardson

4 May 4 May

6:29 p.m.

New subject: [AI] Collecting human labeled relevance judgements for search from readers

On Wed, May 3, 2017 at 12:44 PM, Jonathan Morgan <jmorgan(a)wikimedia.org> wrote:

...

This might be useful. I'll see if i can find a time on both our calendars. I should note though this is explicitly not about experimental design. The data is not going to be used for experimental purposes, but rather to feed into a machine learning pipeline that will re-order search results to provide the best results at the top of the list. For the purpose of ensuring the long tail is represented in the training data for this model I would like to have a few tens of thousands of labels for (query, page) combinations each month. The relevance of pages to a query does have some temporal aspect, so we would likely want to only use the last N months worth of data (TBD). On Wed, May 3, 2017 at 12:24 PM, Erik Bernhardson <

...

ebernhardson(a)wikimedia.org> wrote:

-- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)> _______________________________________________ discovery mailing list discovery(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

Jan Drewniak

8:44 p.m.

New subject: [AI] Collecting human labeled relevance judgements for search from readers

Hi Erik

...

From my understanding, it looks like your looking to collect relevance data

"in reverse". Typically, for this type of data collection, I would assume that you'd present a query with some search results, and ask users "which results are relevant to this query" (which is what discernatron does, at a very high effort level). What I think your proposing instead is that when a user visits an article, we present them with a question that asks "would this search query be relevant to the article you are looking at". I can see this working, provided that the query is controlled and the question is *not* phrased like it is above. I think that for this to work, the question should be phrased in a way that elicits a simple "top-level" (maybe "yes" or "no") response. For example, the question "*is this page about*: 'hydrostone halifax nova scotia' " can be responded to with a thumbs up 👍 or thumbs down 👎, but a question like "is this article relevant to the following query: ..." seems more complicated 🤔 . On Thu, May 4, 2017 at 6:29 PM, Erik Bernhardson <ebernhardson(a)wikimedia.org

...

wrote:

> On Wed, May 3, 2017 at 12:44 PM, Jonathan Morgan <jmorgan(a)wikimedia.org>

...

wrote:

> >> Hi Erik, >> >> I've been using some similar methods to evaluate Related Article >> recommendations >> <https://meta.wikimedia.org/wiki/Research:Evaluating_RelatedArticles_recommendations> >> and the source of the trending article card >> <https://meta.wikimedia.org/wiki/Research:Comparing_most_read_and_trending_edits_for_Top_Articles_feature> >> in the Explore feed on Android. Let me know if you'd like to sit down and >> chat about experimental design sometime. >> >> - J >> >> > This might be useful. I'll see if i can find a time on both our calendars. > I should note though this is explicitly not about experimental design. The > data is not going to be used for experimental purposes, but rather to feed > into a machine learning pipeline that will re-order search results to > provide the best results at the top of the list. For the purpose of > ensuring the long tail is represented in the training data for this model I > would like to have a few tens of thousands of labels for (query, page) > combinations each month. The relevance of pages to a query does have some > temporal aspect, so we would likely want to only use the last N months > worth of data (TBD). > > On Wed, May 3, 2017 at 12:24 PM, Erik Bernhardson < >> ebernhardson(a)wikimedia.org

...

wrote:

>> >>> At our weekly relevance meeting an interesting idea came up about how to >>> collect relevance judgements for the long tail of queries, which make up >>> around 60% of search sessions. >>> >>> We are pondering asking questions on the article pages themselves. >>> Roughly we would manually curate some list of queries we want to collect >>> relevance judgements for. When a user has spent some threshold of time >>> (60s?) on a page we would, for some % of users, check if we have any >>> queries we want labeled for this page, and then ask them if the page is a >>> relevant result for that query. In this way the amount of work asked of >>> individuals is relatively low and hopefully something they can answer >>> without too much work. We know that the average page receives a few >>> thousand page views per day, so even with a relatively low response rate we >>> could probably collect a reasonable number of judgements over some medium >>> length time period (weeks?) >>> >>> These labels would almost certainly be noisy, we would need to collect >>> the same judgement many times to get any kind of certainty on the label. >>> Additionally we would not be able to really explain the nuances of a >>> grading scale with many points, we would probably have to use either a >>> thumbs up/thumbs down approach, or maybe a happy/sad/indifferent smiley >>> face. >>> >>> Does this seem reasonable? Are there other ways we could go about >>> collecting the same data? How to design it in a non-intrusive manner that >>> gets results, but doesn't annoy users? Other thoughts? >>> >>> >>> For some background: >>> >>> * We are currently generating labeled data using statistical analysis >>> (clickmodels) against historical click data. This analysis requires there >>> to be multiple search sessions with the same query presented with similar >>> results to estimate the relevance of those results. A manual review of the >>> results showed queries with clicks from at least 10 sessions had reasonable >>> but not great labels, queries with 35+ sessions looked pretty good, and >>> queries with hundreds of sessions were labeled really well. >>> >>> * an analysis of 80 days worth of search click logs showed that 35 to >>> 40% of search sessions are for queries that are repeated more than 10 times >>> in that 80 day period. Around 20% of search session are for queries that >>> are repeated more than 35 times in that 80 day period. ( >>> https://phabricator.wikimedia.org/P5371) >>> >>> * Our privacy policy prevents us from keeping more than 90 days worth of >>> data from which to run these clickmodels. Practically 80 days is probably a >>> reasonable cutoff, as we will want to re-use the data multiple times before >>> needing to delete it and generate a new set of labels. >>> >>> * We currently collect human relevance judgements with Discernatron ( >>> https://discernatron.wmflabs.org/). This is useful data for manual >>> evaluation of changes, but the data set is much too small (low hundreds of >>> queries, with an average of 50 documents per query) to integrate into >>> machine learning. The process of judging query/document pairs for the >>> community is quite tedious, and it doesn't seem like a great use of >>> engineer time for us to do this ourselves. >>> >>> _______________________________________________ >>> AI mailing list >>> AI(a)lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/ai >>> >>> >> >> >> -- >> Jonathan T. Morgan >> Senior Design Researcher >> Wikimedia Foundation >> User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)> >> >> >> _______________________________________________ >> discovery mailing list >> discovery(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/discovery >> >> > > _______________________________________________ > discovery mailing list > discovery(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/discovery > > -- Jan Drewniak UX Engineer, Discovery Wikimedia Foundation

Trey Jones

9:32 p.m.

New subject: [AI] Collecting human labeled relevance judgements for search from readers

Yeah, this is definitely the reverse of Discernatron. Part of the reason for waiting 60s is that then, hopefully, the reader at least has some idea what the article is about (another difficulty with Discernatron), so they only have to spend a little time guessing what the query is about. We are going to have to work on the wording of the question. It needs to be clear and concise. I worry that *Is this page about "X"?* might make people reply too strictly. A page can be reasonable relevant to X without being *about* X. What about this: *If you searched for X, would this article be a good result?* I'm not sure normal people think of "results". - *Would someone who searched for X want to read this article?*—better - *If someone searched for X, would they want to read this article?*—longer, but easier to parse. - *If someone searched for X, **would they find what they are looking for in this article?*—probably too long More brainstorming on this wouldn't hurt, even if it is very early in the whole process. There's also the wording that goes with the request for a judgement. "Help us make search better!" might get more response than just the judgement question. Folks in fundraising might have good ideas about how to catch people's attention, and at the very least would could learn from them and actively A/B test different options and see what kind of response rate we get. We might also get cleaner A/B test results if we limited their scope—a few pages and a few "queries" where we know the answers, so we can gauge not only response rate, but also engagement, to see if one kind of phrasing makes people try a little harder. We might also want to make "No, thanks" the default button so that it is easier to bail than to give random input. Trey Jones Software Engineer, Discovery Wikimedia Foundation On Thu, May 4, 2017 at 2:44 PM, Jan Drewniak <jdrewniak(a)wikimedia.org> wrote:

...

On Wed, May 3, 2017 at 12:44 PM, Jonathan Morgan <jmorgan(a)wikimedia.org> wrote:

ebernhardson(a)wikimedia.org> wrote:

_______________________________________________ discovery mailing list discovery(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

-- Jan Drewniak UX Engineer, Discovery Wikimedia Foundation _______________________________________________ discovery mailing list discovery(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

Jonathan Morgan

5 May 5 May

1:07 a.m.

New subject: [AI] Collecting human labeled relevance judgements for search from readers

This conversation is exactly what I meant by "experimental design" above. I like Jan's recommendation to keep the prompt simple, and ask to people to provide a quick binary judgement. However I agree that, considering some of the search queries you're showing folks are going to be kind of oddball, you want to give them a little bit of context to help them understand that they're looking at a search query. One possible way to give people the context they need to answer the question accurately is to provide them with, say, three of the top search queries that you think are relevant to the result, and ask them to choose which one is *most* relevant. Without some context, I'm not sure I would be able to give an accurate answer to the question "Is this article about 'hydrostone halifax nova scotia'"? Seeing multiple examples makes decision-making easier. The prompt could be something like "Which set of [search terms/key words/tags] is most relevant to this article?" Adding a "none of the above" option as well would allow you to screen out cases where the responder was either confused by the question, or felt that none of the candidate queries were even remotely relevant. I suggest you loop Aeryn Palmer from Legal in, and add a "why are we asking this?" link into the banner/quicksurvey popup that links to a survey privacy statement page on FoundationWiki <https://wikimediafoundation.org/wiki/Quick_Survey_Privacy_Statement>. Hope that helps, J On Thu, May 4, 2017 at 12:32 PM, Trey Jones <tjones(a)wikimedia.org> wrote:

...

On Wed, May 3, 2017 at 12:44 PM, Jonathan Morgan <jmorgan(a)wikimedia.org> wrote:

ebernhardson(a)wikimedia.org> wrote: > At our weekly relevance meeting an interesting idea came up about how > to collect relevance judgements for the long tail of queries, which make up > around 60% of search sessions. > > We are pondering asking questions on the article pages themselves. > Roughly we would manually curate some list of queries we want to collect > relevance judgements for. When a user has spent some threshold of time > (60s?) on a page we would, for some % of users, check if we have any > queries we want labeled for this page, and then ask them if the page is a > relevant result for that query. In this way the amount of work asked of > individuals is relatively low and hopefully something they can answer > without too much work. We know that the average page receives a few > thousand page views per day, so even with a relatively low response rate we > could probably collect a reasonable number of judgements over some medium > length time period (weeks?) > > These labels would almost certainly be noisy, we would need to collect > the same judgement many times to get any kind of certainty on the label. > Additionally we would not be able to really explain the nuances of a > grading scale with many points, we would probably have to use either a > thumbs up/thumbs down approach, or maybe a happy/sad/indifferent smiley > face. > > Does this seem reasonable? Are there other ways we could go about > collecting the same data? How to design it in a non-intrusive manner that > gets results, but doesn't annoy users? Other thoughts? > > > For some background: > > * We are currently generating labeled data using statistical analysis > (clickmodels) against historical click data. This analysis requires there > to be multiple search sessions with the same query presented with similar > results to estimate the relevance of those results. A manual review of the > results showed queries with clicks from at least 10 sessions had reasonable > but not great labels, queries with 35+ sessions looked pretty good, and > queries with hundreds of sessions were labeled really well. > > * an analysis of 80 days worth of search click logs showed that 35 to > 40% of search sessions are for queries that are repeated more than 10 times > in that 80 day period. Around 20% of search session are for queries that > are repeated more than 35 times in that 80 day period. ( > https://phabricator.wikimedia.org/P5371) > > * Our privacy policy prevents us from keeping more than 90 days worth > of data from which to run these clickmodels. Practically 80 days is > probably a reasonable cutoff, as we will want to re-use the data multiple > times before needing to delete it and generate a new set of labels. > > * We currently collect human relevance judgements with Discernatron ( > https://discernatron.wmflabs.org/). This is useful data for manual > evaluation of changes, but the data set is much too small (low hundreds of > queries, with an average of 50 documents per query) to integrate into > machine learning. The process of judging query/document pairs for the > community is quite tedious, and it doesn't seem like a great use of > engineer time for us to do this ourselves. > > _______________________________________________ > AI mailing list > AI(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/ai > > -- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)> _______________________________________________ discovery mailing list discovery(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

_______________________________________________ discovery mailing list discovery(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

-- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>

Trey Jones

1:27 a.m.

New subject: [AI] Collecting human labeled relevance judgements for search from readers

...

One possible way to give people the context they need to answer the question accurately is to provide them with, say, three of the top search queries that you think are relevant to the result, and ask them to choose which one is *most* relevant.

That might be less confusing, but unfortunately I don't think it would give us what we want. In this scenario, we'd need up/down votes on all three options, and relative ranking among them wouldn't be useful. (I can give an example to explain if that's not clear.) I agree this falls under (or is at least reasonably similar to) experimental design, though, and it'd be great to get help. (While this was Erik's excellent idea, I'm very excited about it because it would mean I could stop feeling guilty about not having done any Discernatron queries in months.) On Thu, May 4, 2017 at 7:07 PM, Jonathan Morgan <jmorgan(a)wikimedia.org> wrote: > This conversation is exactly what I meant by "experimental design" above. > > I like Jan's recommendation to keep the prompt simple, and ask to people > to provide a quick binary judgement. However I agree that, considering some > of the search queries you're showing folks are going to be kind of oddball, > you want to give them a little bit of context to help them understand that > they're looking at a search query.

...

> > Without some context, I'm not sure I would be able to give an accurate > answer to the question "Is this article about 'hydrostone halifax nova > scotia'"? > > Seeing multiple examples makes decision-making easier. The prompt could be > something like "Which set of [search terms/key words/tags] is most relevant > to this article?" > > Adding a "none of the above" option as well would allow you to screen out > cases where the responder was either confused by the question, or felt that > none of the candidate queries were even remotely relevant. > > I suggest you loop Aeryn Palmer from Legal in, and add a "why are we > asking this?" link into the banner/quicksurvey popup that links to a survey > privacy statement page on FoundationWiki > <https://wikimediafoundation.org/wiki/Quick_Survey_Privacy_Statement>. > > Hope that helps, > J > > On Thu, May 4, 2017 at 12:32 PM, Trey Jones <tjones(a)wikimedia.org> wrote: > >> Yeah, this is definitely the reverse of Discernatron. Part of the reason >> for waiting 60s is that then, hopefully, the reader at least has some idea >> what the article is about (another difficulty with Discernatron), so they >> only have to spend a little time guessing what the query is about. >> >> We are going to have to work on the wording of the question. It needs to >> be clear and concise. >> >> I worry that *Is this page about "X"?* might make people reply too >> strictly. A page can be reasonable relevant to X without being *about* >> X. What about this: *If you searched for X, would this article be a good >> result?* I'm not sure normal people think of "results". >> >> - *Would someone who searched for X want to read this article?*—better >> - *If someone searched for X, would they want to read this article?*—longer, >> but easier to parse. >> - *If someone searched for X, **would they find what they are looking >> for in this article?*—probably too long >> >> More brainstorming on this wouldn't hurt, even if it is very early in the >> whole process. >> >> There's also the wording that goes with the request for a judgement. >> "Help us make search better!" might get more response than just the >> judgement question. >> >> Folks in fundraising might have good ideas about how to catch people's >> attention, and at the very least would could learn from them and actively >> A/B test different options and see what kind of response rate we get. >> >> We might also get cleaner A/B test results if we limited their scope—a >> few pages and a few "queries" where we know the answers, so we can gauge >> not only response rate, but also engagement, to see if one kind of phrasing >> makes people try a little harder. >> >> We might also want to make "No, thanks" the default button so that it is >> easier to bail than to give random input. >> >> Trey Jones >> Software Engineer, Discovery >> Wikimedia Foundation >> >> On Thu, May 4, 2017 at 2:44 PM, Jan Drewniak <jdrewniak(a)wikimedia.org> >> wrote: >> >>> Hi Erik >>> >>> From my understanding, it looks like your looking to collect relevance >>> data "in reverse". Typically, for this type of data collection, I would >>> assume that you'd present a query with some search results, and ask users >>> "which results are relevant to this query" (which is what discernatron >>> does, at a very high effort level). >>> >>> What I think your proposing instead is that when a user visits an >>> article, we present them with a question that asks "would this search query >>> be relevant to the article you are looking at". >>> >>> I can see this working, provided that the query is controlled and the >>> question is *not* phrased like it is above. >>> >>> I think that for this to work, the question should be phrased in a way >>> that elicits a simple "top-level" (maybe "yes" or "no") response. For >>> example, the question "*is this page about*: 'hydrostone halifax nova >>> scotia' " can be responded to with a thumbs up 👍 or thumbs down 👎, but a >>> question like "is this article relevant to the following query: ..." seems >>> more complicated 🤔 . >>> >>> >>> On Thu, May 4, 2017 at 6:29 PM, Erik Bernhardson < >>> ebernhardson(a)wikimedia.org> wrote: >>> >>>> On Wed, May 3, 2017 at 12:44 PM, Jonathan Morgan <jmorgan(a)wikimedia.org >>>> > wrote: >>>> >>>>> Hi Erik, >>>>> >>>>> I've been using some similar methods to evaluate Related Article >>>>> recommendations >>>>> <https://meta.wikimedia.org/wiki/Research:Evaluating_RelatedArticles_recommendations> >>>>> and the source of the trending article card >>>>> <https://meta.wikimedia.org/wiki/Research:Comparing_most_read_and_trending_edits_for_Top_Articles_feature> >>>>> in the Explore feed on Android. Let me know if you'd like to sit down and >>>>> chat about experimental design sometime. >>>>> >>>>> - J >>>>> >>>>> >>>> This might be useful. I'll see if i can find a time on both our >>>> calendars. I should note though this is explicitly not about experimental >>>> design. The data is not going to be used for experimental purposes, but >>>> rather to feed into a machine learning pipeline that will re-order search >>>> results to provide the best results at the top of the list. For the purpose >>>> of ensuring the long tail is represented in the training data for this >>>> model I would like to have a few tens of thousands of labels for (query, >>>> page) combinations each month. The relevance of pages to a query does have >>>> some temporal aspect, so we would likely want to only use the last N months >>>> worth of data (TBD). >>>> >>>> On Wed, May 3, 2017 at 12:24 PM, Erik Bernhardson < >>>>> ebernhardson(a)wikimedia.org> wrote: >>>>> >>>>>> At our weekly relevance meeting an interesting idea came up about how >>>>>> to collect relevance judgements for the long tail of queries, which make up >>>>>> around 60% of search sessions. >>>>>> >>>>>> We are pondering asking questions on the article pages themselves. >>>>>> Roughly we would manually curate some list of queries we want to collect >>>>>> relevance judgements for. When a user has spent some threshold of time >>>>>> (60s?) on a page we would, for some % of users, check if we have any >>>>>> queries we want labeled for this page, and then ask them if the page is a >>>>>> relevant result for that query. In this way the amount of work asked of >>>>>> individuals is relatively low and hopefully something they can answer >>>>>> without too much work. We know that the average page receives a few >>>>>> thousand page views per day, so even with a relatively low response rate we >>>>>> could probably collect a reasonable number of judgements over some medium >>>>>> length time period (weeks?) >>>>>> >>>>>> These labels would almost certainly be noisy, we would need to >>>>>> collect the same judgement many times to get any kind of certainty on the >>>>>> label. Additionally we would not be able to really explain the nuances of a >>>>>> grading scale with many points, we would probably have to use either a >>>>>> thumbs up/thumbs down approach, or maybe a happy/sad/indifferent smiley >>>>>> face. >>>>>> >>>>>> Does this seem reasonable? Are there other ways we could go about >>>>>> collecting the same data? How to design it in a non-intrusive manner that >>>>>> gets results, but doesn't annoy users? Other thoughts? >>>>>> >>>>>> >>>>>> For some background: >>>>>> >>>>>> * We are currently generating labeled data using statistical analysis >>>>>> (clickmodels) against historical click data. This analysis requires there >>>>>> to be multiple search sessions with the same query presented with similar >>>>>> results to estimate the relevance of those results. A manual review of the >>>>>> results showed queries with clicks from at least 10 sessions had reasonable >>>>>> but not great labels, queries with 35+ sessions looked pretty good, and >>>>>> queries with hundreds of sessions were labeled really well. >>>>>> >>>>>> * an analysis of 80 days worth of search click logs showed that 35 to >>>>>> 40% of search sessions are for queries that are repeated more than 10 times >>>>>> in that 80 day period. Around 20% of search session are for queries that >>>>>> are repeated more than 35 times in that 80 day period. ( >>>>>> https://phabricator.wikimedia.org/P5371) >>>>>> >>>>>> * Our privacy policy prevents us from keeping more than 90 days worth >>>>>> of data from which to run these clickmodels. Practically 80 days is >>>>>> probably a reasonable cutoff, as we will want to re-use the data multiple >>>>>> times before needing to delete it and generate a new set of labels. >>>>>> >>>>>> * We currently collect human relevance judgements with Discernatron ( >>>>>> https://discernatron.wmflabs.org/). This is useful data for manual >>>>>> evaluation of changes, but the data set is much too small (low hundreds of >>>>>> queries, with an average of 50 documents per query) to integrate into >>>>>> machine learning. The process of judging query/document pairs for the >>>>>> community is quite tedious, and it doesn't seem like a great use of >>>>>> engineer time for us to do this ourselves. >>>>>> >>>>>> _______________________________________________ >>>>>> AI mailing list >>>>>> AI(a)lists.wikimedia.org >>>>>> https://lists.wikimedia.org/mailman/listinfo/ai >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Jonathan T. Morgan >>>>> Senior Design Researcher >>>>> Wikimedia Foundation >>>>> User:Jmorgan (WMF) >>>>> <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)> >>>>> >>>>> >>>>> _______________________________________________ >>>>> discovery mailing list >>>>> discovery(a)lists.wikimedia.org >>>>> https://lists.wikimedia.org/mailman/listinfo/discovery >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> discovery mailing list >>>> discovery(a)lists.wikimedia.org >>>> https://lists.wikimedia.org/mailman/listinfo/discovery >>>> >>>> >>> >>> >>> -- >>> Jan Drewniak >>> UX Engineer, Discovery >>> Wikimedia Foundation >>> >>> _______________________________________________ >>> discovery mailing list >>> discovery(a)lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/discovery >>> >>> >> >> _______________________________________________ >> discovery mailing list >> discovery(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/discovery >> >> > > > -- > Jonathan T. Morgan > Senior Design Researcher > Wikimedia Foundation > User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)> > > > _______________________________________________ > discovery mailing list > discovery(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/discovery > >

Jonathan Morgan

1:37 a.m.

New subject: [AI] Collecting human labeled relevance judgements for search from readers

On Thu, May 4, 2017 at 4:27 PM, Trey Jones <tjones(a)wikimedia.org> wrote:

...

One possible way to give people the context they need to answer the

question accurately is to provide them with, say, three of the top search queries that you think are relevant to the result, and ask them to choose which one is *most* relevant.

"click all that apply"? Okay, I'm spitballing now :P Happy to talk about this more if you want. Sounds like I'd need a little more background on your goals and the data you're working with to be more helpful - J

...

I agree this falls under (or is at least reasonably similar to) experimental design, though, and it'd be great to get help. (While this was Erik's excellent idea, I'm very excited about it because it would mean I could stop feeling guilty about not having done any Discernatron queries in months.) On Thu, May 4, 2017 at 7:07 PM, Jonathan Morgan <jmorgan(a)wikimedia.org> wrote:

Yeah, this is definitely the reverse of Discernatron. Part of the reason for waiting 60s is that then, hopefully, the reader at least has some idea what the article is about (another difficulty with Discernatron), so they only have to spend a little time guessing what the query is about. We are going to have to work on the wording of the question. It needs to be clear and concise. I worry that *Is this page about "X"?* might make people reply too strictly. A page can be reasonable relevant to X without being *about* X. What about this: *If you searched for X, would this article be a good result?* I'm not sure normal people think of "results". - *Would someone who searched for X want to read this article?* —better - *If someone searched for X, would they want to read this article?*—longer, but easier to parse. - *If someone searched for X, **would they find what they are looking for in this article?*—probably too long More brainstorming on this wouldn't hurt, even if it is very early in the whole process. There's also the wording that goes with the request for a judgement. "Help us make search better!" might get more response than just the judgement question. Folks in fundraising might have good ideas about how to catch people's attention, and at the very least would could learn from them and actively A/B test different options and see what kind of response rate we get. We might also get cleaner A/B test results if we limited their scope—a few pages and a few "queries" where we know the answers, so we can gauge not only response rate, but also engagement, to see if one kind of phrasing makes people try a little harder. We might also want to make "No, thanks" the default button so that it is easier to bail than to give random input. Trey Jones Software Engineer, Discovery Wikimedia Foundation On Thu, May 4, 2017 at 2:44 PM, Jan Drewniak <jdrewniak(a)wikimedia.org> wrote:

Hi Erik From my understanding, it looks like your looking to collect relevance data "in reverse". Typically, for this type of data collection, I would assume that you'd present a query with some search results, and ask users "which results are relevant to this query" (which is what discernatron does, at a very high effort level). What I think your proposing instead is that when a user visits an article, we present them with a question that asks "would this search query be relevant to the article you are looking at". I can see this working, provided that the query is controlled and the question is *not* phrased like it is above. I think that for this to work, the question should be phrased in a way that elicits a simple "top-level" (maybe "yes" or "no") response. For example, the question "*is this page about*: 'hydrostone halifax nova scotia' " can be responded to with a thumbs up 👍 or thumbs down 👎, but a question like "is this article relevant to the following query: ..." seems more complicated 🤔 . On Thu, May 4, 2017 at 6:29 PM, Erik Bernhardson < ebernhardson(a)wikimedia.org> wrote: > On Wed, May 3, 2017 at 12:44 PM, Jonathan Morgan < > jmorgan(a)wikimedia.org> wrote: > >> Hi Erik, >> >> I've been using some similar methods to evaluate Related Article >> recommendations >> <https://meta.wikimedia.org/wiki/Research:Evaluating_RelatedArticles_recommendations> >> and the source of the trending article card >> <https://meta.wikimedia.org/wiki/Research:Comparing_most_read_and_trending_edits_for_Top_Articles_feature> >> in the Explore feed on Android. Let me know if you'd like to sit down and >> chat about experimental design sometime. >> >> - J >> >> > This might be useful. I'll see if i can find a time on both our > calendars. I should note though this is explicitly not about experimental > design. The data is not going to be used for experimental purposes, but > rather to feed into a machine learning pipeline that will re-order search > results to provide the best results at the top of the list. For the purpose > of ensuring the long tail is represented in the training data for this > model I would like to have a few tens of thousands of labels for (query, > page) combinations each month. The relevance of pages to a query does have > some temporal aspect, so we would likely want to only use the last N months > worth of data (TBD). > > On Wed, May 3, 2017 at 12:24 PM, Erik Bernhardson < >> ebernhardson(a)wikimedia.org> wrote: >> >>> At our weekly relevance meeting an interesting idea came up about >>> how to collect relevance judgements for the long tail of queries, which >>> make up around 60% of search sessions. >>> >>> We are pondering asking questions on the article pages themselves. >>> Roughly we would manually curate some list of queries we want to collect >>> relevance judgements for. When a user has spent some threshold of time >>> (60s?) on a page we would, for some % of users, check if we have any >>> queries we want labeled for this page, and then ask them if the page is a >>> relevant result for that query. In this way the amount of work asked of >>> individuals is relatively low and hopefully something they can answer >>> without too much work. We know that the average page receives a few >>> thousand page views per day, so even with a relatively low response rate we >>> could probably collect a reasonable number of judgements over some medium >>> length time period (weeks?) >>> >>> These labels would almost certainly be noisy, we would need to >>> collect the same judgement many times to get any kind of certainty on the >>> label. Additionally we would not be able to really explain the nuances of a >>> grading scale with many points, we would probably have to use either a >>> thumbs up/thumbs down approach, or maybe a happy/sad/indifferent smiley >>> face. >>> >>> Does this seem reasonable? Are there other ways we could go about >>> collecting the same data? How to design it in a non-intrusive manner that >>> gets results, but doesn't annoy users? Other thoughts? >>> >>> >>> For some background: >>> >>> * We are currently generating labeled data using statistical >>> analysis (clickmodels) against historical click data. This analysis >>> requires there to be multiple search sessions with the same query presented >>> with similar results to estimate the relevance of those results. A manual >>> review of the results showed queries with clicks from at least 10 sessions >>> had reasonable but not great labels, queries with 35+ sessions looked >>> pretty good, and queries with hundreds of sessions were labeled really well. >>> >>> * an analysis of 80 days worth of search click logs showed that 35 >>> to 40% of search sessions are for queries that are repeated more than 10 >>> times in that 80 day period. Around 20% of search session are for queries >>> that are repeated more than 35 times in that 80 day period. ( >>> https://phabricator.wikimedia.org/P5371) >>> >>> * Our privacy policy prevents us from keeping more than 90 days >>> worth of data from which to run these clickmodels. Practically 80 days is >>> probably a reasonable cutoff, as we will want to re-use the data multiple >>> times before needing to delete it and generate a new set of labels. >>> >>> * We currently collect human relevance judgements with Discernatron ( >>> https://discernatron.wmflabs.org/). This is useful data for manual >>> evaluation of changes, but the data set is much too small (low hundreds of >>> queries, with an average of 50 documents per query) to integrate into >>> machine learning. The process of judging query/document pairs for the >>> community is quite tedious, and it doesn't seem like a great use of >>> engineer time for us to do this ourselves. >>> >>> _______________________________________________ >>> AI mailing list >>> AI(a)lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/ai >>> >>> >> >> >> -- >> Jonathan T. Morgan >> Senior Design Researcher >> Wikimedia Foundation >> User:Jmorgan (WMF) >> <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)> >> >> >> _______________________________________________ >> discovery mailing list >> discovery(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/discovery >> >> > > _______________________________________________ > discovery mailing list > discovery(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/discovery > > -- Jan Drewniak UX Engineer, Discovery Wikimedia Foundation _______________________________________________ discovery mailing list discovery(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

_______________________________________________ discovery mailing list discovery(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

-- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>

Erik Bernhardson

4 May 4 May

9:58 p.m.

New subject: [AI] Collecting human labeled relevance judgements for search from readers

On Thu, May 4, 2017 at 11:44 AM, Jan Drewniak <jdrewniak(a)wikimedia.org> wrote:

...

Indeed this is looking to go in reverse. The problem with asking people performing a query if the results are any good is the specific queries I'm interested in are not performed by very man people. These queries see on average less than one instance per week. By doing it in reverse we can sample from a (hopefully) much larger distribution. I still need to do some analysis though to see if these long tail queries also return long tail pages, as in ones that only receive a few tens of hits per day. If the result pages are also rarely viewed then this scheme will likely not work. We do have a particularly large sample of queries (~ 10 million or so) to draw from though, so can likely find queries with popular enough pages to get information about.

...

What I think your proposing instead is that when a user visits an article, we present them with a question that asks "would this search query be relevant to the article you are looking at". I can see this working, provided that the query is controlled and the question is *not* phrased like it is above. I think that for this to work, the question should be phrased in a way that elicits a simple "top-level" (maybe "yes" or "no") response. For example, the question "*is this page about*: 'hydrostone halifax nova scotia' " can be responded to with a thumbs up 👍 or thumbs down 👎, but a question like "is this article relevant to the following query: ..." seems more complicated 🤔 .

Indeed word smithing will be important here. I'm not sure 'is this page about' will be quite the right question, but I'm also not sure what the right question is. Relevance is a little more nuanced than what the page is about, some judgement needs to be made about the intent of the query and if the page can satisfy that intent.

...

On Thu, May 4, 2017 at 6:29 PM, Erik Bernhardson < ebernhardson(a)wikimedia.org> wrote:

On Wed, May 3, 2017 at 12:44 PM, Jonathan Morgan <jmorgan(a)wikimedia.org> wrote:

ebernhardson(a)wikimedia.org> wrote:

_______________________________________________ discovery mailing list discovery(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

Kevin Smith

10:30 p.m.

New subject: [AI] Collecting human labeled relevance judgements for search from readers

We have design researchers and researchers at the WMF who are experts at figuring out the wording of questions and prompts, so please involve them at appropriate moments. Speaking without expertise, some form of the adjective "relevant" might be worth considering including in the prompt. Kevin Smith Agile Coach, Wikimedia Foundation On Thu, May 4, 2017 at 12:58 PM, Erik Bernhardson < ebernhardson(a)wikimedia.org> wrote:

...

On Thu, May 4, 2017 at 11:44 AM, Jan Drewniak <jdrewniak(a)wikimedia.org> wrote:

On Thu, May 4, 2017 at 6:29 PM, Erik Bernhardson < ebernhardson(a)wikimedia.org> wrote:

On Wed, May 3, 2017 at 12:44 PM, Jonathan Morgan <jmorgan(a)wikimedia.org> wrote:

_______________________________________________ discovery mailing list discovery(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

2549

days inactive

2550

days old

discovery@lists.wikimedia.org

Manage subscription

9 comments

5 participants

tags (0)

participants (5)

Erik Bernhardson
Jan Drewniak
Jonathan Morgan
Kevin Smith
Trey Jones