Re: [discovery] [AI] Collecting human labeled relevance judgements for search from readers

5 May 2017

      This conversation is exactly what I meant by "experimental design" above.
I like Jan's recommendation to keep the prompt simple, and ask to people to
provide a quick binary judgement. However I agree that, considering some of
the search queries you're showing folks are going to be kind of oddball,
you want to give them a little bit of context to help them understand that
they're looking at a search query.
One possible way to give people the context they need to answer the
question accurately  is to provide them with, say, three of the top search
queries that you think are relevant to the result, and ask them to choose
which one is *most* relevant.
Without some context, I'm not sure I would be able to give an accurate
answer to the question "Is this article about 'hydrostone halifax nova
scotia'"?
Seeing multiple examples makes decision-making easier. The prompt could be
something like "Which set of [search terms/key words/tags] is most relevant
to this article?"
Adding a "none of the above" option as well would allow you to screen out
cases where the responder was either confused by the question, or felt that
none of the candidate queries were even remotely relevant.
I suggest you loop Aeryn Palmer from Legal in, and add a "why are we asking
this?" link into the banner/quicksurvey popup that links to a survey
privacy statement page on FoundationWiki
https://wikimediafoundation.org/wiki/Quick_Survey_Privacy_Statement.
Hope that helps,
J
On Thu, May 4, 2017 at 12:32 PM, Trey Jones tjones@wikimedia.org wrote:
...
Yeah, this is definitely the reverse of Discernatron. Part of the reason
for waiting 60s is that then, hopefully, the reader at least has some idea
what the article is about (another difficulty with Discernatron), so they
only have to spend a little time guessing what the query is about.
We are going to have to work on the wording of the question. It needs to
be clear and concise.
I worry that *Is this page about "X"?* might make people reply too
strictly. A page can be reasonable relevant to X without being *about* X.
What about this: *If you searched for X, would this article be a good
result?* I'm not sure normal people think of "results".

*Would someone who searched for X want to read this article?*—better
*If someone searched for X, would they want to read this article?*—longer,

but easier to parse.

*If someone searched for X, **would they find what they are looking

for in this article?*—probably too long
More brainstorming on this wouldn't hurt, even if it is very early in the
whole process.
There's also the wording that goes with the request for a judgement. "Help
us make search better!" might get more response than just the judgement
question.
Folks in fundraising might have good ideas about how to catch people's
attention, and at the very least would could learn from them and actively
A/B test different options and see what kind of response rate we get.
We might also get cleaner A/B test results if we limited their scope—a few
pages and a few "queries" where we know the answers, so we can gauge not
only response rate, but also engagement, to see if one kind of phrasing
makes people try a little harder.
We might also want to make "No, thanks" the default button so that it is
easier to bail than to give random input.
Trey Jones
Software Engineer, Discovery
Wikimedia Foundation
On Thu, May 4, 2017 at 2:44 PM, Jan Drewniak jdrewniak@wikimedia.org
wrote:
...
Hi Erik
From my understanding, it looks like your looking to collect relevance
data "in reverse". Typically, for this type of data collection, I would
assume that you'd present a query with some search results, and ask users
"which results are relevant to this query" (which is what discernatron
does, at a very high effort level).
What I think your proposing instead is that when a user visits an
article, we present them with a question that asks "would this search query
be  relevant to the article you are looking at".
I can see this working, provided that the query is controlled and the
question is *not* phrased like it is above.
I think that for this to work, the question should be phrased in a way
that elicits a simple "top-level" (maybe "yes" or "no") response. For
example, the question "*is this page about*: 'hydrostone halifax nova
scotia' " can be responded to with a thumbs up 👍 or thumbs down 👎, but a
question like "is this article relevant to the following query: ..." seems
more complicated 🤔 .
On Thu, May 4, 2017 at 6:29 PM, Erik Bernhardson <
ebernhardson@wikimedia.org> wrote:
...
On Wed, May 3, 2017 at 12:44 PM, Jonathan Morgan jmorgan@wikimedia.org
wrote:
...
Hi Erik,
I've been using some similar methods to evaluate Related Article
recommendations
https://meta.wikimedia.org/wiki/Research:Evaluating_RelatedArticles_recommendations
and the source of the trending article card
https://meta.wikimedia.org/wiki/Research:Comparing_most_read_and_trending_edits_for_Top_Articles_feature
in the Explore feed on Android. Let me know if you'd like to sit down and
chat about experimental design sometime.

J

This might be useful. I'll see if i can find a time on both our
calendars. I should note though this is explicitly not about experimental
design. The data is not going to be used for experimental purposes, but
rather to feed into a machine learning pipeline that will re-order search
results to provide the best results at the top of the list. For the purpose
of ensuring the long tail is represented in the training data for this
model I would like to have a few tens of thousands of labels for (query,
page) combinations each month. The relevance of pages to a query does have
some temporal aspect, so we would likely want to only use the last N months
worth of data (TBD).
On Wed, May 3, 2017 at 12:24 PM, Erik Bernhardson <
...
ebernhardson@wikimedia.org> wrote:
...
At our weekly relevance meeting an interesting idea came up about how
to collect relevance judgements for the long tail of queries, which make up
around 60% of search sessions.
We are pondering asking questions on the article pages themselves.
Roughly we would manually curate some list of queries we want to collect
relevance judgements for. When a user has spent some threshold of time
(60s?) on a page we would, for some % of users, check if we have any
queries we want labeled for this page, and then ask them if the page is a
relevant result for that query. In this way the amount of work asked of
individuals is relatively low and hopefully something they can answer
without too much work. We know that the average page receives a few
thousand page views per day, so even with a relatively low response rate we
could probably collect a reasonable number of judgements over some medium
length time period (weeks?)
These labels would almost certainly be noisy, we would need to collect
the same judgement many times to get any kind of certainty on the label.
Additionally we would not be able to really explain the nuances of a
grading scale with many points, we would probably have to use either a
thumbs up/thumbs down approach, or maybe a happy/sad/indifferent smiley
face.
Does this seem reasonable? Are there other ways we could go about
collecting the same data? How to design it in a non-intrusive manner that
gets results, but doesn't annoy users? Other thoughts?
For some background:

We are currently generating labeled data using statistical analysis

(clickmodels) against historical click data. This analysis requires there
to be multiple search sessions with the same query presented with similar
results to estimate the relevance of those results. A manual review of the
results showed queries with clicks from at least 10 sessions had reasonable
but not great labels, queries with 35+ sessions looked pretty good, and
queries with hundreds of sessions were labeled really well.

an analysis of 80 days worth of search click logs showed that 35 to

40% of search sessions are for queries that are repeated more than 10 times
in that 80 day period. Around 20% of search session are for queries that
are repeated more than 35 times in that 80 day period. (
https://phabricator.wikimedia.org/P5371)

Our privacy policy prevents us from keeping more than 90 days worth

of data from which to run these clickmodels. Practically 80 days is
probably a reasonable cutoff, as we will want to re-use the data multiple
times before needing to delete it and generate a new set of labels.

We currently collect human relevance judgements with Discernatron (

https://discernatron.wmflabs.org/). This is useful data for manual
evaluation of changes, but the data set is much too small (low hundreds of
queries, with an average of 50 documents per query) to integrate into
machine learning. The process of judging query/document pairs for the
community is quite tedious, and it doesn't seem like a great use of
engineer time for us to do this ourselves.

AI mailing list
AI@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/ai
--
Jonathan T. Morgan
Senior Design Researcher
Wikimedia Foundation
User:Jmorgan (WMF) https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)

discovery mailing list
discovery@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery

discovery mailing list
discovery@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery
--
Jan Drewniak
UX Engineer, Discovery
Wikimedia Foundation

discovery mailing list
discovery@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery

discovery mailing list
discovery@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery
-- 
Jonathan T. Morgan
Senior Design Researcher
Wikimedia Foundation
User:Jmorgan (WMF) https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [discovery] [AI] Collecting human labeled relevance judgements for search from readers