How to measure disagreement between human judges in discernatron?

List overview All Threads
Download

newer

older

Discovery dashboards down

Re: [discovery] [Engineering]...

Erik Bernhardson

26 Oct 2016 26 Oct '16

2:21 p.m.

For a little backstory, in discernatron multiple judges provide scores in from 0 to 3 for results. Typically we only request a single query to be reviewed by two judges. We would like to measure the level of disagreement between these two judges, and if it crosses some threshold get two more scores, so we can then measure disagreement in the group of 4. Somehow though, we need to define how to measure that level of disagreement and what the threshold for needing more scores is.

Some specialized concerns: * It is probably important to include not just that the users gave different values, but also how far apart they are. The difference between a 3 and a 2 is much smaller than between a 2 and a 0. * If the users agree that 80% of the results are all 0, but disagree on the last 20%, even though the average disagreement is low it's probably still important? Might be worthwhile to take all the agreements about irrelevant results and remove them before calculating disagreement? Not sure...

I know we have a few math nerds here on the list, so hoping someone has a few ideas.

Attachments:

attachment.htm (text/html — 1.2 KB)

Show replies by date

Jonathan Morgan

26 Oct 26 Oct

2:31 p.m.

Disclaimer: I'm not a math nerd, and I don't know the history of Discernatron very well.

...but re: your second specialized concern, have you considered running some more sophisticated inter-rater reliability statistics to get a better sense of the degree of disagreement (controlling for random chance?). See for example: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3402032/

- Jonathan

On Wed, Oct 26, 2016 at 11:21 AM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:

...

For a little backstory, in discernatron multiple judges provide scores in from 0 to 3 for results. Typically we only request a single query to be reviewed by two judges. We would like to measure the level of disagreement between these two judges, and if it crosses some threshold get two more scores, so we can then measure disagreement in the group of 4. Somehow though, we need to define how to measure that level of disagreement and what the threshold for needing more scores is.

Some specialized concerns:

It is probably important to include not just that the users gave

different values, but also how far apart they are. The difference between a 3 and a 2 is much smaller than between a 2 and a 0.

If the users agree that 80% of the results are all 0, but disagree on

the last 20%, even though the average disagreement is low it's probably still important? Might be worthwhile to take all the agreements about irrelevant results and remove them before calculating disagreement? Not sure...

I know we have a few math nerds here on the list, so hoping someone has a few ideas.

discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

-- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF) https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)

Justin Ormont

2:37 p.m.

You're in the area of: https://en.wikipedia.org/wiki/Inter-rater_reliability

--justin

On Wed, Oct 26, 2016 at 11:31 AM, Jonathan Morgan jmorgan@wikimedia.org wrote:

...

Disclaimer: I'm not a math nerd, and I don't know the history of Discernatron very well.

...but re: your second specialized concern, have you considered running some more sophisticated inter-rater reliability statistics to get a better sense of the degree of disagreement (controlling for random chance?). See for example: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3402032/

Jonathan

On Wed, Oct 26, 2016 at 11:21 AM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:

...
For a little backstory, in discernatron multiple judges provide scores in from 0 to 3 for results. Typically we only request a single query to be reviewed by two judges. We would like to measure the level of disagreement between these two judges, and if it crosses some threshold get two more scores, so we can then measure disagreement in the group of 4. Somehow though, we need to define how to measure that level of disagreement and what the threshold for needing more scores is.

Some specialized concerns:

It is probably important to include not just that the users gave

different values, but also how far apart they are. The difference between a 3 and a 2 is much smaller than between a 2 and a 0.

If the users agree that 80% of the results are all 0, but disagree on

the last 20%, even though the average disagreement is low it's probably still important? Might be worthwhile to take all the agreements about irrelevant results and remove them before calculating disagreement? Not sure...

I know we have a few math nerds here on the list, so hoping someone has a few ideas.

discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

-- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF) https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)

discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

Erik Bernhardson

27 Oct 27 Oct

10:51 a.m.

Thanks for the links! This is exactly what I was looking for. After reviewing some of the options I'm going to do a first try with Krippendorff's Alpha. It's ability to handle missing data from some graders as well as being applicable down to n=2 seems promising.

On Oct 26, 2016 11:37 AM, "Justin Ormont" justin.ormont@gmail.com wrote:

...

You're in the area of: https://en.wikipedia.org/wiki/ Inter-rater_reliability

--justin

On Wed, Oct 26, 2016 at 11:31 AM, Jonathan Morgan jmorgan@wikimedia.org wrote:

...
Disclaimer: I'm not a math nerd, and I don't know the history of Discernatron very well.

...but re: your second specialized concern, have you considered running some more sophisticated inter-rater reliability statistics to get a better sense of the degree of disagreement (controlling for random chance?). See for example: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3402032/

Jonathan

On Wed, Oct 26, 2016 at 11:21 AM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:

...
For a little backstory, in discernatron multiple judges provide scores in from 0 to 3 for results. Typically we only request a single query to be reviewed by two judges. We would like to measure the level of disagreement between these two judges, and if it crosses some threshold get two more scores, so we can then measure disagreement in the group of 4. Somehow though, we need to define how to measure that level of disagreement and what the threshold for needing more scores is.

Some specialized concerns:

It is probably important to include not just that the users gave

different values, but also how far apart they are. The difference between a 3 and a 2 is much smaller than between a 2 and a 0.

If the users agree that 80% of the results are all 0, but disagree on

the last 20%, even though the average disagreement is low it's probably still important? Might be worthwhile to take all the agreements about irrelevant results and remove them before calculating disagreement? Not sure...

I know we have a few math nerds here on the list, so hoping someone has a few ideas.

discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

-- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF) https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)

discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

Erik Bernhardson

7:21 p.m.

To follow up a little here, i implemented Krippendorff's Alpha and ran it against all the data we currently have in discernatron, the distribution looks something like:

constraint count alpha >= 0.80 11 0.667 <= alpha < 0.80 18 0.500 <= alpha < 0.667 20 0.333 <= alpha < 0.500 26 0 <= alpha < 0.333 43 alpha < 0 31

This is a much lower level of agreement than i was expecting. The literature suggests 0.80 as a reliable cutoff, and 0.667 as a cutoff from which you can draw tentative conclusions. Below 0 indicates there is less agreement than random chance, and we need to re-evaluate the instructions to be more clear (probably true).

On Thu, Oct 27, 2016 at 7:51 AM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:

...

Thanks for the links! This is exactly what I was looking for. After reviewing some of the options I'm going to do a first try with Krippendorff's Alpha. It's ability to handle missing data from some graders as well as being applicable down to n=2 seems promising.

On Oct 26, 2016 11:37 AM, "Justin Ormont" justin.ormont@gmail.com wrote:

...
You're in the area of: https://en.wikipedia.org/wiki/ Inter-rater_reliability

--justin

On Wed, Oct 26, 2016 at 11:31 AM, Jonathan Morgan jmorgan@wikimedia.org wrote:

...
Disclaimer: I'm not a math nerd, and I don't know the history of Discernatron very well.

...but re: your second specialized concern, have you considered running some more sophisticated inter-rater reliability statistics to get a better sense of the degree of disagreement (controlling for random chance?). See for example: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3402032/

Jonathan

On Wed, Oct 26, 2016 at 11:21 AM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:

...
For a little backstory, in discernatron multiple judges provide scores in from 0 to 3 for results. Typically we only request a single query to be reviewed by two judges. We would like to measure the level of disagreement between these two judges, and if it crosses some threshold get two more scores, so we can then measure disagreement in the group of 4. Somehow though, we need to define how to measure that level of disagreement and what the threshold for needing more scores is.

Some specialized concerns:

It is probably important to include not just that the users gave

different values, but also how far apart they are. The difference between a 3 and a 2 is much smaller than between a 2 and a 0.

If the users agree that 80% of the results are all 0, but disagree on

the last 20%, even though the average disagreement is low it's probably still important? Might be worthwhile to take all the agreements about irrelevant results and remove them before calculating disagreement? Not sure...

I know we have a few math nerds here on the list, so hoping someone has a few ideas.

discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

-- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF) https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)

discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

Trey Jones

31 Oct 31 Oct

10:43 a.m.

Interesting stats, Erik. Thanks for sharing these.

More clarity in the documentation is always good.

For some of the negative alpha agreement values, a couple of possible sources come to mind. There could be bad faith actors, who either didn't really try very hard, or purposely put in incorrect or random values. There could also be genuine disagreement between the scorers about the relevance of the results—David and I discussed one that we both scored, and we disagreed like that. I can see where he was coming from, but it wasn't how I thought of it. In both of these cases, additional scores would help.

One thing I noticed that has been inconsistent in my own scoring is that early on when I got a mediocre query (i.e., I wouldn't expect any *really* good results), I tended to grade on a curve. I'd give "Relevant" to the best result even if it wasn't actually a great result. After grading a couple of queries for which there were clearly *no* good results (i.e., *everything* was irrelevant), I think I stopped grading on a curve.

My point there is that's one place we could improve the documentation: explicitly state that not every query has good results. It's okay to not have any result rated as "relevant"—or this could already be in the docs, and the problem is that no one reads them. :(

Another thing that Erik has suggested was trying to filter out wildly non-encyclopedic queries (like "SANTA CLAUS PRINT OUT 3D PAPERTOYS"), and maybe really vague queries (like "antonio parent"), but that's potentially more work than filtering PII, and much more subjective.

It might also be informative to review some of the scores for the negative alphas and see if something obvious is going on, in which case we'd know the alpha calculation is doing its job.

On Thu, Oct 27, 2016 at 7:21 PM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:

...

To follow up a little here, i implemented Krippendorff's Alpha and ran it against all the data we currently have in discernatron, the distribution looks something like:

constraint count alpha >= 0.80 11 0.667 <= alpha < 0.80 18 0.500 <= alpha < 0.667 20 0.333 <= alpha < 0.500 26 0 <= alpha < 0.333 43 alpha < 0 31

This is a much lower level of agreement than i was expecting. The literature suggests 0.80 as a reliable cutoff, and 0.667 as a cutoff from which you can draw tentative conclusions. Below 0 indicates there is less agreement than random chance, and we need to re-evaluate the instructions to be more clear (probably true).

Justin Ormont

4:13 p.m.

Did you add any honey-pot answers? Answers where you know the results quite well (via many judges agreeing), or are very obvious (q=Obama, results=[en:Presidency of Barack Obama, en:A4 Paper]).

I've set these up as a pre-test before starting the judgment session to check that the judge understands the instructions, and randomly included to weed out judges that randomly select answers.

Investigating the labels (individual query-result pair) with the most disagreement may be useful, along with the judges with the most disagreement.

--justin

On Mon, Oct 31, 2016 at 7:43 AM, Trey Jones tjones@wikimedia.org wrote:

...

Interesting stats, Erik. Thanks for sharing these.

More clarity in the documentation is always good.

For some of the negative alpha agreement values, a couple of possible sources come to mind. There could be bad faith actors, who either didn't really try very hard, or purposely put in incorrect or random values. There could also be genuine disagreement between the scorers about the relevance of the results—David and I discussed one that we both scored, and we disagreed like that. I can see where he was coming from, but it wasn't how I thought of it. In both of these cases, additional scores would help.

One thing I noticed that has been inconsistent in my own scoring is that early on when I got a mediocre query (i.e., I wouldn't expect any *really* good results), I tended to grade on a curve. I'd give "Relevant" to the best result even if it wasn't actually a great result. After grading a couple of queries for which there were clearly *no* good results (i.e., *everything* was irrelevant), I think I stopped grading on a curve.

My point there is that's one place we could improve the documentation: explicitly state that not every query has good results. It's okay to not have any result rated as "relevant"—or this could already be in the docs, and the problem is that no one reads them. :(

Another thing that Erik has suggested was trying to filter out wildly non-encyclopedic queries (like "SANTA CLAUS PRINT OUT 3D PAPERTOYS"), and maybe really vague queries (like "antonio parent"), but that's potentially more work than filtering PII, and much more subjective.

It might also be informative to review some of the scores for the negative alphas and see if something obvious is going on, in which case we'd know the alpha calculation is doing its job.

On Thu, Oct 27, 2016 at 7:21 PM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:

...
To follow up a little here, i implemented Krippendorff's Alpha and ran it against all the data we currently have in discernatron, the distribution looks something like:

constraint count alpha >= 0.80 11 0.667 <= alpha < 0.80 18 0.500 <= alpha < 0.667 20 0.333 <= alpha < 0.500 26 0 <= alpha < 0.333 43 alpha < 0 31

This is a much lower level of agreement than i was expecting. The literature suggests 0.80 as a reliable cutoff, and 0.667 as a cutoff from which you can draw tentative conclusions. Below 0 indicates there is less agreement than random chance, and we need to re-evaluate the instructions to be more clear (probably true).

discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

Erik Bernhardson

1 Nov 1 Nov

6:23 p.m.

On Mon, Oct 31, 2016 at 1:13 PM, Justin Ormont justin.ormont@gmail.com wrote:

...

Did you add any honey-pot answers? Answers where you know the results quite well (via many judges agreeing), or are very obvious (q=Obama, results=[en:Presidency of Barack Obama, en:A4 Paper]).

I've set these up as a pre-test before starting the judgment session to check that the judge understands the instructions, and randomly included to weed out judges that randomly select answers.

We don't have any honey-pot answers yet, we had thought about it but had

hoped that since there was no real benefit to users of doing a bad job (no payments, no leaderboard to get on) it wouldn't be necessary. We may have to re-evaluate that though, it seems a common way to deal with crowd-sourced data.

...

Investigating the labels (individual query-result pair) with the most disagreement may be useful, along with the judges with the most disagreement.

Good idea, will be looking into it soon.

...

--justin

On Mon, Oct 31, 2016 at 7:43 AM, Trey Jones tjones@wikimedia.org wrote:

...
Interesting stats, Erik. Thanks for sharing these.

More clarity in the documentation is always good.

For some of the negative alpha agreement values, a couple of possible sources come to mind. There could be bad faith actors, who either didn't really try very hard, or purposely put in incorrect or random values. There could also be genuine disagreement between the scorers about the relevance of the results—David and I discussed one that we both scored, and we disagreed like that. I can see where he was coming from, but it wasn't how I thought of it. In both of these cases, additional scores would help.

One thing I noticed that has been inconsistent in my own scoring is that early on when I got a mediocre query (i.e., I wouldn't expect any *really* good results), I tended to grade on a curve. I'd give "Relevant" to the best result even if it wasn't actually a great result. After grading a couple of queries for which there were clearly *no* good results (i.e., *everything* was irrelevant), I think I stopped grading on a curve.

My point there is that's one place we could improve the documentation: explicitly state that not every query has good results. It's okay to not have any result rated as "relevant"—or this could already be in the docs, and the problem is that no one reads them. :(

Another thing that Erik has suggested was trying to filter out wildly non-encyclopedic queries (like "SANTA CLAUS PRINT OUT 3D PAPERTOYS"), and maybe really vague queries (like "antonio parent"), but that's potentially more work than filtering PII, and much more subjective.

It might also be informative to review some of the scores for the negative alphas and see if something obvious is going on, in which case we'd know the alpha calculation is doing its job.

On Thu, Oct 27, 2016 at 7:21 PM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:

...
To follow up a little here, i implemented Krippendorff's Alpha and ran it against all the data we currently have in discernatron, the distribution looks something like:

constraint count alpha >= 0.80 11 0.667 <= alpha < 0.80 18 0.500 <= alpha < 0.667 20 0.333 <= alpha < 0.500 26 0 <= alpha < 0.333 43 alpha < 0 31

This is a much lower level of agreement than i was expecting. The literature suggests 0.80 as a reliable cutoff, and 0.667 as a cutoff from which you can draw tentative conclusions. Below 0 indicates there is less agreement than random chance, and we need to re-evaluate the instructions to be more clear (probably true).

discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

Erik Bernhardson

6:29 p.m.

On Mon, Oct 31, 2016 at 7:43 AM, Trey Jones tjones@wikimedia.org wrote:

...

Interesting stats, Erik. Thanks for sharing these.

More clarity in the documentation is always good.

For some of the negative alpha agreement values, a couple of possible sources come to mind. There could be bad faith actors, who either didn't really try very hard, or purposely put in incorrect or random values. There could also be genuine disagreement between the scorers about the relevance of the results—David and I discussed one that we both scored, and we disagreed like that. I can see where he was coming from, but it wasn't how I thought of it. In both of these cases, additional scores would help.

One thing I noticed that has been inconsistent in my own scoring is that early on when I got a mediocre query (i.e., I wouldn't expect any *really* good results), I tended to grade on a curve. I'd give "Relevant" to the best result even if it wasn't actually a great result. After grading a couple of queries for which there were clearly *no* good results (i.e., *everything* was irrelevant), I think I stopped grading on a curve.

I think I've done much the same. I reviewed a few of the low-reliability queries that I have evaluated, and for at least a few I wouldn't mind going back and updating my scores to be more in-line with the other reviewer. In some cases, perhaps earlier in my usage of the platform, i was being more generous with the relevancy level of items that didn't have any particularly good answers.

...

My point there is that's one place we could improve the documentation: explicitly state that not every query has good results. It's okay to not have any result rated as "relevant"—or this could already be in the docs, and the problem is that no one reads them. :(

I'll update the documentation, although I do wonder if users really read that page beyond the first time using the system.

...

Another thing that Erik has suggested was trying to filter out wildly non-encyclopedic queries (like "SANTA CLAUS PRINT OUT 3D PAPERTOYS"), and maybe really vague queries (like "antonio parent"), but that's potentially more work than filtering PII, and much more subjective.

It might also be informative to review some of the scores for the negative alphas and see if something obvious is going on, in which case we'd know the alpha calculation is doing its job.

I looked at a few with negative alpha, the calculation is certainly doing its job. In most cases there is an obvious disagreement about what constitutes relevance for the query. In some cases this may be a difference in interpretation of the query intent (always difficult), of one reviewer being more generous in declaring things partially relevant vs the other reviewer giving a fairly strict irrelevant classification. There were a couple that were a bit surprising, for example if there are 7 results to grade and one grader gives them all 0, while the other grader gives five 0's, one 1 and one 2, that has a negative alpha. That does seem the right approach though.

...

On Thu, Oct 27, 2016 at 7:21 PM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:

...
To follow up a little here, i implemented Krippendorff's Alpha and ran it against all the data we currently have in discernatron, the distribution looks something like:

constraint count alpha >= 0.80 11 0.667 <= alpha < 0.80 18 0.500 <= alpha < 0.667 20 0.333 <= alpha < 0.500 26 0 <= alpha < 0.333 43 alpha < 0 31

This is a much lower level of agreement than i was expecting. The literature suggests 0.80 as a reliable cutoff, and 0.667 as a cutoff from which you can draw tentative conclusions. Below 0 indicates there is less agreement than random chance, and we need to re-evaluate the instructions to be more clear (probably true).

discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

2989

Age (days ago)

2995

Last active (days ago)

discovery@lists.wikimedia.org

8 comments

4 participants

tags (0)

participants (4)

Erik Bernhardson
Jonathan Morgan
Justin Ormont
Trey Jones