More randomly ordered thoughts on this. I'm sure they altogether amount to a lot of work, and I don't expect your team to do it all. I just offer it as a list of things that might be interesting for you guys to look at.
Earlier, you said you didn't have the resources to conduct a study with human subjects. I just wanted to point out again that you may be overestimating the time it takes to put one of those together. Putting together a web site where people can go and volunteer to evaluate the results of your algorithm on a page they know well would probably require a fraction of the time it took your team to develop the algorithm itself (I'm sure dealing with that much data, and figuring out who wrote what contiguous parts of text took a lot of tinkering). I'm sure it would not be hard to convince editors and reviewers on wikipedia to volunteer to review 30-50 pages through such a special site. You could set up the experiment so that the reviewer reviews a page WITHOUT any of your colourings, and then you compute the overlap between their changes, and the segments that your system thought were untrustworthy. By doing it that way, you would be avoiding the issue of favourable or disfavourable evaluator bias towards the system (because the evaluator does not know which segments the system deems unreliable). Also, you would be catching both false positive and false negatives (whereas the way I evaluated the system, I could only catch false positives).
Another thought is that maybe you should not evaluate the system's ability to rate trustworthiness of **segments**, but rather rate the trustworthiness of whole pages. In other words, it could be that if you focus the user's attention on pages that have a large proportion of red in them, you would have very few false positives on that task (of course you might have lots of false negatives too, but it's still better than what we have now which is NOTHING). For a task like that, you would of course have to compare your system to a naïve implementation which, for example, uses a pages's "age" (i.e. elapsed time since initial creation), or the number of edits by different people, or the number of visits by different people, as an indication of trustworthiness. Have you looked at how your measure correlates with the review board's evaluation of page quality?
Earlier, you also said you didn't think the algorithm could do a good job at predicting what parts of the text are questionable because so many good contributions are made by occasional one-off contributors and anonymous authors. Maybe all this means is that you need to put your threshold for colouring at a higher value. In other words, only colour those parts which have been written by people who are KNOWN to be poor contributors. Also, for anonymous contributors, do you treat all of them as one big "user", or do you try to distinguish by IP address? Have you tried eliminating anonymous contributions from your processing altogether? Have you tried eliminating contributors who only made contributions to < N pages? How do these things affect the values of "internal" metrics you mentioned in your previous email.
Finally, it may be that this tool is more useful for reviewers and editors than for readers of wikipedia. So, what would be good metrics for reviewers? * Precision/Recall of pages that are low quality. * Precsion/Recall of segments in those low quality pages that are low quality * Productivity boost when reviewing pages using this system vs not. For example, does a reviewer using this system end up doing more edits per hour than a reviewer who does not?
That's it for now. Like I said, I'm sure they altogether amount to a lot of work, and I don't expect your team to do it all. I just offer it as a list of things that might be interesting for you guys to look at.
Cheers,
Alain