I understand that your time is limited and that an evaluation with human subjects may not be what you want to do.

I unfortunately do not have time to do such a study either, as I am quite overcommitted myself with 6 projects on the go at once.

Cheers,

Alain

From: luca.de.alfaro@gmail.com [mailto:luca.de.alfaro@gmail.com] On Behalf Of Luca de Alfaro
Sent: December 21, 2007 3:12 PM
To: Desilets, Alain
Cc: Research into Wikimedia content and communities
Subject: Re: [Wiki-research-l] Wikipedia colored according to trust

Dear Alain,

I would like to encourage you to do such a study. We can also provide data for you.
In a sense, an independent study would be even better.

The project currently has me (30% time; I also do other research, teach, work in Univ committees, etc), Ian (full time, except that he is taking classes, as he should), and Bo (20% time, as he is also working on other things).
We have to be very careful in prioritizing things to do.
Also, we now have a tiny bit of funding, and while this enables me to fund students and pay for machines, it also means that I cannot do a user study on a spur of the moment - I need the approval of the Ethics Board of my university, and to get that, I need to apply, talk to them, etc -- it all takes time.
Finally, my experience is that a user study is hardly ever simple. First you start, then you realize that you are asking the wrong questions, then you redo it, then you figure there is too much noise in the data, then you redo it, then you realize the data analysis does not quite work because what you really needed were these other data.... and the data analysis is also not simple: how to sample pages, how to sample text...

But as I say, I think it would be great if you did it, and we could provide you the data you need.

Luca

On Dec 21, 2007 3:04 AM, Desilets, Alain < Alain.Desilets@nrc-cnrc.gc.ca> wrote:

More randomly ordered thoughts on this. I'm sure they altogether amount to a lot of work, and I don't expect your team to do it all. I just offer it as a list of things that might be interesting for you guys to look at.

Earlier, you said you didn't have the resources to conduct a study with human subjects. I just wanted to point out again that you may be overestimating the time it takes to put one of those together. Putting together a web site where people can go and volunteer to evaluate the results of your algorithm on a page they know well would probably require a fraction of the time it took your team to develop the algorithm itself (I'm sure dealing with that much data, and figuring out who wrote what contiguous parts of text took a lot of tinkering). I'm sure it would not be hard to convince editors and reviewers on wikipedia to volunteer to review 30-50 pages through such a special site. You could set up the experiment so that the reviewer reviews a page WITHOUT any of your colourings, and then you compute the overlap between their changes, and the segments that your system thought were untrustworthy. By doing it that way, you would be avoiding the issue of favourable or disfavourable evaluator bias towards the system (because the evaluator does not know which segments the system deems unreliable). Also, you would be catching both false positive and false negatives (whereas the way I evaluated the system, I could only catch false positives).

Another thought is that maybe you should not evaluate the system's ability to rate trustworthiness of **segments**, but rather rate the trustworthiness of whole pages. In other words, it could be that if you focus the user's attention on pages that have a large proportion of red in them, you would have very few false positives on that task (of course you might have lots of false negatives too, but it's still better than what we have now which is NOTHING). For a task like that, you would of course have to compare your system to a naïve implementation which, for example, uses a pages's "age" ( i.e. elapsed time since initial creation), or the number of edits by different people, or the number of visits by different people, as an indication of trustworthiness. Have you looked at how your measure correlates with the review board's evaluation of page quality?

Earlier, you also said you didn't think the algorithm could do a good job at predicting what parts of the text are questionable because so many good contributions are made by occasional one-off contributors and anonymous authors. Maybe all this means is that you need to put your threshold for colouring at a higher value. In other words, only colour those parts which have been written by people who are KNOWN to be poor contributors. Also, for anonymous contributors, do you treat all of them as one big "user", or do you try to distinguish by IP address? Have you tried eliminating anonymous contributions from your processing altogether? Have you tried eliminating contributors who only made contributions to < N pages? How do these things affect the values of "internal" metrics you mentioned in your previous email.

Finally, it may be that this tool is more useful for reviewers and editors than for readers of wikipedia. So, what would be good metrics for reviewers?
* Precision/Recall of pages that are low quality.
* Precsion/Recall of segments in those low quality pages that are low quality
* Productivity boost when reviewing pages using this system vs not. For example, does a reviewer using this system end up doing more edits per hour than a reviewer who does not?

That's it for now. Like I said, I'm sure they altogether amount to a lot of work, and I don't expect your team to do it all. I just offer it as a list of things that might be interesting for you guys to look at.

Cheers,

Alain