Disclaimer: I'm not a math nerd, and I don't know the history of Discernatron very well. 

...but re: your second specialized concern, have you considered running some more sophisticated inter-rater reliability statistics to get a better sense of the degree of disagreement (controlling for random chance?). See for example: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3402032/

- Jonathan

On Wed, Oct 26, 2016 at 11:21 AM, Erik Bernhardson <ebernhardson@wikimedia.org> wrote:
For a little backstory, in discernatron multiple judges provide scores in from 0 to 3 for results. Typically we only request a single query to be reviewed by two judges. We would like to measure the level of disagreement between these two judges, and if it crosses some threshold get two more scores, so we can then measure disagreement in the group of 4. Somehow though, we need to define how to measure that level of disagreement and what the threshold for needing more scores is.

Some specialized concerns:
* It is probably important to include not just that the users gave different values, but also how far apart they are. The difference between a 3 and a 2 is much smaller than between a 2 and a 0.
* If the users agree that 80% of the results are all 0, but disagree on the last 20%, even though the average disagreement is low it's probably still important? Might be worthwhile to take all the agreements about irrelevant results and remove them before calculating disagreement? Not sure...

I know we have a few math nerds here on the list, so hoping someone has a few ideas.

_______________________________________________
discovery mailing list
discovery@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery




--
Jonathan T. Morgan
Senior Design Researcher
Wikimedia Foundation