On Mon, Oct 31, 2016 at 1:13 PM, Justin Ormont <justin.ormont@gmail.com> wrote:
Did you add any honey-pot answers? Answers where you know the results quite well (via many judges agreeing), or are very obvious (q=Obama, results=[en:Presidency of Barack Obama, en:A4 Paper]).

I've set these up as a pre-test before starting the judgment session to check that the judge understands the instructions, and randomly included to weed out judges that randomly select answers.

We don't have any honey-pot answers yet, we had thought about it but had hoped that since there was no real benefit to users of doing a bad job (no payments, no leaderboard to get on) it wouldn't be necessary. We may have to re-evaluate that though, it seems a common way to deal with crowd-sourced data.
 
Investigating the labels (individual query-result pair) with the most disagreement may be useful, along with the judges with the most disagreement.

Good idea, will be looking into it soon. 

--justin

On Mon, Oct 31, 2016 at 7:43 AM, Trey Jones <tjones@wikimedia.org> wrote:
Interesting stats, Erik. Thanks for sharing these.

More clarity in the documentation is always good.

For some of the negative alpha agreement values, a couple of possible sources come to mind. There could be bad faith actors, who either didn't really try very hard, or purposely put in incorrect or random values. There could also be genuine disagreement between the scorers about the relevance of the results—David and I discussed one that we both scored, and we disagreed like that. I can see where he was coming from, but it wasn't how I thought of it. In both of these cases, additional scores would help.

One thing I noticed that has been inconsistent in my own scoring is that early on when I got a mediocre query (i.e., I wouldn't expect any really good results), I tended to grade on a curve. I'd give "Relevant" to the best result even if it wasn't actually a great result. After grading a couple of queries for which there were clearly no good results (i.e., everything was irrelevant), I think I stopped grading on a curve.

My point there is that's one place we could improve the documentation: explicitly state that not every query has good results. It's okay to not have any result rated as "relevant"—or this could already be in the docs, and the problem is that no one reads them. :(

Another thing that Erik has suggested was trying to filter out wildly non-encyclopedic queries (like "SANTA CLAUS PRINT OUT 3D PAPERTOYS"), and maybe really vague queries (like "antonio parent"), but that's potentially more work than filtering PII, and much more subjective.

It might also be informative to review some of the scores for the negative alphas and see if something obvious is going on, in which case we'd know the alpha calculation is doing its job.


On Thu, Oct 27, 2016 at 7:21 PM, Erik Bernhardson <ebernhardson@wikimedia.org> wrote:
To follow up a little here, i implemented Krippendorff's Alpha and ran it against all the data we currently have in discernatron, the distribution looks something like:

constraintcount
alpha >= 0.8011
0.667 <= alpha < 0.8018
0.500 <= alpha < 0.66720
0.333 <= alpha < 0.50026
0 <= alpha < 0.33343
alpha < 031

This is a much lower level of agreement than i was expecting. The literature suggests 0.80 as a reliable cutoff, and 0.667 as a cutoff from which you can draw tentative conclusions. Below 0 indicates there is less agreement than random chance, and we need to re-evaluate the instructions to be more clear (probably true).



_______________________________________________
discovery mailing list
discovery@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery



_______________________________________________
discovery mailing list
discovery@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery