At our weekly relevance meeting an interesting idea came up about how to collect relevance judgements for the long tail of queries, which make up around 60% of search sessions.
We are pondering asking questions on the article pages themselves. Roughly we would manually curate some list of queries we want to collect relevance judgements for. When a user has spent some threshold of time (60s?) on a page we would, for some % of users, check if we have any queries we want labeled for this page, and then ask them if the page is a relevant result for that query. In this way the amount of work asked of individuals is relatively low and hopefully something they can answer without too much work. We know that the average page receives a few thousand page views per day, so even with a relatively low response rate we could probably collect a reasonable number of judgements over some medium length time period (weeks?)
These labels would almost certainly be noisy, we would need to collect the same judgement many times to get any kind of certainty on the label. Additionally we would not be able to really explain the nuances of a grading scale with many points, we would probably have to use either a thumbs up/thumbs down approach, or maybe a happy/sad/indifferent smiley face.
Does this seem reasonable? Are there other ways we could go about collecting the same data? How to design it in a non-intrusive manner that gets results, but doesn't annoy users? Other thoughts?
For some background:
* We are currently generating labeled data using statistical analysis (clickmodels) against historical click data. This analysis requires there to be multiple search sessions with the same query presented with similar results to estimate the relevance of those results. A manual review of the results showed queries with clicks from at least 10 sessions had reasonable but not great labels, queries with 35+ sessions looked pretty good, and queries with hundreds of sessions were labeled really well.
* an analysis of 80 days worth of search click logs showed that 35 to 40% of search sessions are for queries that are repeated more than 10 times in that 80 day period. Around 20% of search session are for queries that are repeated more than 35 times in that 80 day period. ( https://phabricator.wikimedia.org/P5371)
* Our privacy policy prevents us from keeping more than 90 days worth of data from which to run these clickmodels. Practically 80 days is probably a reasonable cutoff, as we will want to re-use the data multiple times before needing to delete it and generate a new set of labels.
* We currently collect human relevance judgements with Discernatron ( https://discernatron.wmflabs.org/). This is useful data for manual evaluation of changes, but the data set is much too small (low hundreds of queries, with an average of 50 documents per query) to integrate into machine learning. The process of judging query/document pairs for the community is quite tedious, and it doesn't seem like a great use of engineer time for us to do this ourselves.
Hi Erik,
I've been using some similar methods to evaluate Related Article recommendations https://meta.wikimedia.org/wiki/Research:Evaluating_RelatedArticles_recommendations and the source of the trending article card https://meta.wikimedia.org/wiki/Research:Comparing_most_read_and_trending_edits_for_Top_Articles_feature in the Explore feed on Android. Let me know if you'd like to sit down and chat about experimental design sometime.
- J
On Wed, May 3, 2017 at 12:24 PM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:
At our weekly relevance meeting an interesting idea came up about how to collect relevance judgements for the long tail of queries, which make up around 60% of search sessions.
We are pondering asking questions on the article pages themselves. Roughly we would manually curate some list of queries we want to collect relevance judgements for. When a user has spent some threshold of time (60s?) on a page we would, for some % of users, check if we have any queries we want labeled for this page, and then ask them if the page is a relevant result for that query. In this way the amount of work asked of individuals is relatively low and hopefully something they can answer without too much work. We know that the average page receives a few thousand page views per day, so even with a relatively low response rate we could probably collect a reasonable number of judgements over some medium length time period (weeks?)
These labels would almost certainly be noisy, we would need to collect the same judgement many times to get any kind of certainty on the label. Additionally we would not be able to really explain the nuances of a grading scale with many points, we would probably have to use either a thumbs up/thumbs down approach, or maybe a happy/sad/indifferent smiley face.
Does this seem reasonable? Are there other ways we could go about collecting the same data? How to design it in a non-intrusive manner that gets results, but doesn't annoy users? Other thoughts?
For some background:
- We are currently generating labeled data using statistical analysis
(clickmodels) against historical click data. This analysis requires there to be multiple search sessions with the same query presented with similar results to estimate the relevance of those results. A manual review of the results showed queries with clicks from at least 10 sessions had reasonable but not great labels, queries with 35+ sessions looked pretty good, and queries with hundreds of sessions were labeled really well.
- an analysis of 80 days worth of search click logs showed that 35 to 40%
of search sessions are for queries that are repeated more than 10 times in that 80 day period. Around 20% of search session are for queries that are repeated more than 35 times in that 80 day period. (https://phabricator. wikimedia.org/P5371)
- Our privacy policy prevents us from keeping more than 90 days worth of
data from which to run these clickmodels. Practically 80 days is probably a reasonable cutoff, as we will want to re-use the data multiple times before needing to delete it and generate a new set of labels.
- We currently collect human relevance judgements with Discernatron (
https://discernatron.wmflabs.org/). This is useful data for manual evaluation of changes, but the data set is much too small (low hundreds of queries, with an average of 50 documents per query) to integrate into machine learning. The process of judging query/document pairs for the community is quite tedious, and it doesn't seem like a great use of engineer time for us to do this ourselves.
AI mailing list AI@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ai
On Wed, May 3, 2017 at 12:44 PM, Jonathan Morgan jmorgan@wikimedia.org wrote:
Hi Erik,
I've been using some similar methods to evaluate Related Article recommendations https://meta.wikimedia.org/wiki/Research:Evaluating_RelatedArticles_recommendations and the source of the trending article card https://meta.wikimedia.org/wiki/Research:Comparing_most_read_and_trending_edits_for_Top_Articles_feature in the Explore feed on Android. Let me know if you'd like to sit down and chat about experimental design sometime.
- J
This might be useful. I'll see if i can find a time on both our calendars. I should note though this is explicitly not about experimental design. The data is not going to be used for experimental purposes, but rather to feed into a machine learning pipeline that will re-order search results to provide the best results at the top of the list. For the purpose of ensuring the long tail is represented in the training data for this model I would like to have a few tens of thousands of labels for (query, page) combinations each month. The relevance of pages to a query does have some temporal aspect, so we would likely want to only use the last N months worth of data (TBD).
On Wed, May 3, 2017 at 12:24 PM, Erik Bernhardson <
ebernhardson@wikimedia.org> wrote:
At our weekly relevance meeting an interesting idea came up about how to collect relevance judgements for the long tail of queries, which make up around 60% of search sessions.
We are pondering asking questions on the article pages themselves. Roughly we would manually curate some list of queries we want to collect relevance judgements for. When a user has spent some threshold of time (60s?) on a page we would, for some % of users, check if we have any queries we want labeled for this page, and then ask them if the page is a relevant result for that query. In this way the amount of work asked of individuals is relatively low and hopefully something they can answer without too much work. We know that the average page receives a few thousand page views per day, so even with a relatively low response rate we could probably collect a reasonable number of judgements over some medium length time period (weeks?)
These labels would almost certainly be noisy, we would need to collect the same judgement many times to get any kind of certainty on the label. Additionally we would not be able to really explain the nuances of a grading scale with many points, we would probably have to use either a thumbs up/thumbs down approach, or maybe a happy/sad/indifferent smiley face.
Does this seem reasonable? Are there other ways we could go about collecting the same data? How to design it in a non-intrusive manner that gets results, but doesn't annoy users? Other thoughts?
For some background:
- We are currently generating labeled data using statistical analysis
(clickmodels) against historical click data. This analysis requires there to be multiple search sessions with the same query presented with similar results to estimate the relevance of those results. A manual review of the results showed queries with clicks from at least 10 sessions had reasonable but not great labels, queries with 35+ sessions looked pretty good, and queries with hundreds of sessions were labeled really well.
- an analysis of 80 days worth of search click logs showed that 35 to 40%
of search sessions are for queries that are repeated more than 10 times in that 80 day period. Around 20% of search session are for queries that are repeated more than 35 times in that 80 day period. ( https://phabricator.wikimedia.org/P5371)
- Our privacy policy prevents us from keeping more than 90 days worth of
data from which to run these clickmodels. Practically 80 days is probably a reasonable cutoff, as we will want to re-use the data multiple times before needing to delete it and generate a new set of labels.
- We currently collect human relevance judgements with Discernatron (
https://discernatron.wmflabs.org/). This is useful data for manual evaluation of changes, but the data set is much too small (low hundreds of queries, with an average of 50 documents per query) to integrate into machine learning. The process of judging query/document pairs for the community is quite tedious, and it doesn't seem like a great use of engineer time for us to do this ourselves.
AI mailing list AI@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ai
-- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF) https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery