On Mon, Oct 31, 2016 at 7:43 AM, Trey Jones <tjones(a)wikimedia.org> wrote:
Interesting stats, Erik. Thanks for sharing these.
More clarity in the documentation is always good.
For some of the negative alpha agreement values, a couple of possible
sources come to mind. There could be bad faith actors, who either didn't
really try very hard, or purposely put in incorrect or random values. There
could also be genuine disagreement between the scorers about the relevance
of the results—David and I discussed one that we both scored, and we
disagreed like that. I can see where he was coming from, but it wasn't how
I thought of it. In both of these cases, additional scores would help.
One thing I noticed that has been inconsistent in my own scoring is that
early on when I got a mediocre query (i.e., I wouldn't expect any *really*
good results), I tended to grade on a curve. I'd give "Relevant" to the
best result even if it wasn't actually a great result. After grading a
couple of queries for which there were clearly *no* good results (i.e.,
*everything* was irrelevant), I think I stopped grading on a curve.
I think I've done much the same. I reviewed a few of the low-reliability
queries that I have evaluated, and for at least a few I wouldn't mind going
back and updating my scores to be more in-line with the other reviewer. In
some cases, perhaps earlier in my usage of the platform, i was being more
generous with the relevancy level of items that didn't have any
particularly good answers.
My point there is that's one place we could
improve the documentation:
explicitly state that not every query has good results. It's okay to not
have any result rated as "relevant"—or this could already be in the docs,
and the problem is that no one reads them. :(
I'll update the documentation, although I do wonder if users really read
that page beyond the first time using the system.
Another thing that Erik has suggested was trying to
filter out wildly
non-encyclopedic queries (like "SANTA CLAUS PRINT OUT 3D PAPERTOYS"), and
maybe really vague queries (like "antonio parent"), but that's potentially
more work than filtering PII, and much more subjective.
It might also be informative to review some of the scores for the negative
alphas and see if something obvious is going on, in which case we'd know
the alpha calculation is doing its job.
I looked at a few with negative alpha, the calculation is certainly doing
its job. In most cases there is an obvious disagreement about what
constitutes relevance for the query. In some cases this may be a difference
in interpretation of the query intent (always difficult), of one reviewer
being more generous in declaring things partially relevant vs the other
reviewer giving a fairly strict irrelevant classification. There were a
couple that were a bit surprising, for example if there are 7 results to
grade and one grader gives them all 0, while the other grader gives five
0's, one 1 and one 2, that has a negative alpha. That does seem the right
approach though.
On Thu, Oct 27, 2016 at 7:21 PM, Erik Bernhardson <
ebernhardson(a)wikimedia.org> wrote:
To follow up a little here, i implemented
Krippendorff's Alpha and ran it
against all the data we currently have in discernatron, the distribution
looks something like:
constraint count
alpha >= 0.80 11
0.667 <= alpha < 0.80 18
0.500 <= alpha < 0.667 20
0.333 <= alpha < 0.500 26
0 <= alpha < 0.333 43
alpha < 0 31
This is a much lower level of agreement than i was expecting. The
literature suggests 0.80 as a reliable cutoff, and 0.667 as a cutoff from
which you can draw tentative conclusions. Below 0 indicates there is less
agreement than random chance, and we need to re-evaluate the instructions
to be more clear (probably true).
_______________________________________________
discovery mailing list
discovery(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery