Re: [AI] [discovery] Collecting human labeled relevance judgements for search from readers - AI

4 May 2017

On Thu, May 4, 2017 at 11:44 AM, Jan Drewniak &lt;jdrewniak(a)wikimedia.org&gt;
wrote:

...
  Hi Erik

 From my understanding, it looks like your looking to collect relevance
 data "in reverse". Typically, for this type of data collection, I would
 assume that you'd present a query with some search results, and ask users
 "which results are relevant to this query" (which is what discernatron
 does, at a very high effort level).

Indeed this is looking to go in reverse. The problem with asking people
performing a query if the results are any good is the specific queries I'm
interested in are not performed by very man people. These queries see on
average less than one instance per week.  By doing it in reverse we can
sample from a (hopefully) much larger distribution. I still need to do some
analysis though to see if these long tail queries also return long tail
pages, as in ones that only receive a few tens of hits per day. If the
result pages are also rarely viewed then this scheme will likely not work.
We do have a particularly large sample of queries (~ 10 million or so) to
draw from though, so can likely find queries with popular enough pages to
get information about.

...
  What I think your proposing instead is that when a
user visits an article,
 we present them with a question that asks "would this search query be
 relevant to the article you are looking at".

 I can see this working, provided that the query is controlled and the
 question is *not* phrased like it is above.

 I think that for this to work, the question should be phrased in a way
 that elicits a simple "top-level" (maybe "yes" or "no")
response. For
 example, the question "*is this page about*: 'hydrostone halifax nova
 scotia' " can be responded to with a thumbs up 👍 or thumbs down 👎, but a
 question like "is this article relevant to the following query: ..." seems
 more complicated 🤔 .

Indeed word smithing will be important here. I'm not sure 'is this page
about' will be quite the right question, but I'm also not sure what the
right question is. Relevance is a little more nuanced than what the page is
about, some judgement needs to be made about the intent of the query and if
the page can satisfy that intent.

...

 On Thu, May 4, 2017 at 6:29 PM, Erik Bernhardson <
 ebernhardson(a)wikimedia.org&gt; wrote:

  On Wed, May 3, 2017 at 12:44 PM, Jonathan Morgan
&lt;jmorgan(a)wikimedia.org&gt;
 wrote:

  Hi Erik,

 I've been using some similar methods to evaluate Related Article
 recommendations

<https://meta.wikimedia.org/wiki/Research:Evaluating_RelatedArticles_recommendations>
 and the source of the trending article card

<https://meta.wikimedia.org/wiki/Research:Comparing_most_read_and_trending_edits_for_Top_Articles_feature>
 in the Explore feed on Android. Let me know if you'd like to sit down and
 chat about experimental design sometime.

 - J

  This might be useful. I'll see if i can find a time on both our
 calendars. I should note though this is explicitly not about experimental
 design. The data is not going to be used for experimental purposes, but
 rather to feed into a machine learning pipeline that will re-order search
 results to provide the best results at the top of the list. For the purpose
 of ensuring the long tail is represented in the training data for this
 model I would like to have a few tens of thousands of labels for (query,
 page) combinations each month. The relevance of pages to a query does have
 some temporal aspect, so we would likely want to only use the last N months
 worth of data (TBD).

 On Wed, May 3, 2017 at 12:24 PM, Erik Bernhardson <
  ebernhardson(a)wikimedia.org&gt; wrote:

  At our weekly relevance meeting an interesting
idea came up about how
 to collect relevance judgements for the long tail of queries, which make up
 around 60% of search sessions.

 We are pondering asking questions on the article pages themselves.
 Roughly we would manually curate some list of queries we want to collect
 relevance judgements for. When a user has spent some threshold of time
 (60s?) on a page we would, for some % of users, check if we have any
 queries we want labeled for this page, and then ask them if the page is a
 relevant result for that query. In this way the amount of work asked of
 individuals is relatively low and hopefully something they can answer
 without too much work. We know that the average page receives a few
 thousand page views per day, so even with a relatively low response rate we
 could probably collect a reasonable number of judgements over some medium
 length time period (weeks?)

 These labels would almost certainly be noisy, we would need to collect
 the same judgement many times to get any kind of certainty on the label.
 Additionally we would not be able to really explain the nuances of a
 grading scale with many points, we would probably have to use either a
 thumbs up/thumbs down approach, or maybe a happy/sad/indifferent smiley
 face.

 Does this seem reasonable? Are there other ways we could go about
 collecting the same data? How to design it in a non-intrusive manner that
 gets results, but doesn't annoy users? Other thoughts?

 For some background:

 * We are currently generating labeled data using statistical analysis
 (clickmodels) against historical click data. This analysis requires there
 to be multiple search sessions with the same query presented with similar
 results to estimate the relevance of those results. A manual review of the
 results showed queries with clicks from at least 10 sessions had reasonable
 but not great labels, queries with 35+ sessions looked pretty good, and
 queries with hundreds of sessions were labeled really well.

 * an analysis of 80 days worth of search click logs showed that 35 to
 40% of search sessions are for queries that are repeated more than 10 times
 in that 80 day period. Around 20% of search session are for queries that
 are repeated more than 35 times in that 80 day period. (
 https://phabricator.wikimedia.org/P5371)

 * Our privacy policy prevents us from keeping more than 90 days worth
 of data from which to run these clickmodels. Practically 80 days is
 probably a reasonable cutoff, as we will want to re-use the data multiple
 times before needing to delete it and generate a new set of labels.

 * We currently collect human relevance judgements with Discernatron (
 https://discernatron.wmflabs.org/). This is useful data for manual
 evaluation of changes, but the data set is much too small (low hundreds of
 queries, with an average of 50 documents per query) to integrate into
 machine learning. The process of judging query/document pairs for the
 community is quite tedious, and it doesn't seem like a great use of
 engineer time for us to do this ourselves.

 _______________________________________________
 AI mailing list
 AI(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/ai

 --
 Jonathan T. Morgan
 Senior Design Researcher
 Wikimedia Foundation
 User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>

 _______________________________________________
 discovery mailing list
 discovery(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/discovery

 _______________________________________________
 discovery mailing list
 discovery(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/discovery

 --
 Jan Drewniak
 UX Engineer, Discovery
 Wikimedia Foundation

 _______________________________________________
 discovery mailing list
 discovery(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/discovery