I would like to encourage you to do such a study. We can also provide data
In a sense, an independent study would be even better.
The project currently has me (30% time; I also do other research, teach,
work in Univ committees, etc), Ian (full time, except that he is taking
classes, as he should), and Bo (20% time, as he is also working on other
We have to be very careful in prioritizing things to do.
Also, we now have a tiny bit of funding, and while this enables me to fund
students and pay for machines, it also means that I cannot do a user study
on a spur of the moment - I need the approval of the Ethics Board of my
university, and to get that, I need to apply, talk to them, etc -- it all
Finally, my experience is that a user study is hardly ever simple. First
you start, then you realize that you are asking the wrong questions, then
you redo it, then you figure there is too much noise in the data, then you
redo it, then you realize the data analysis does not quite work because what
you really needed were these other data.... and the data analysis is also
not simple: how to sample pages, how to sample text...
But as I say, I think it would be great if you did it, and we could provide
you the data you need.
On Dec 21, 2007 3:04 AM, Desilets, Alain <Alain.Desilets(a)nrc-cnrc.gc.ca>
More randomly ordered thoughts on this. I'm sure
they altogether amount to
a lot of work, and I don't expect your team to do it all. I just offer it as
a list of things that might be interesting for you guys to look at.
Earlier, you said you didn't have the resources to conduct a study with
human subjects. I just wanted to point out again that you may be
overestimating the time it takes to put one of those together. Putting
together a web site where people can go and volunteer to evaluate the
results of your algorithm on a page they know well would probably require a
fraction of the time it took your team to develop the algorithm itself (I'm
sure dealing with that much data, and figuring out who wrote what contiguous
parts of text took a lot of tinkering). I'm sure it would not be hard to
convince editors and reviewers on wikipedia to volunteer to review 30-50
pages through such a special site. You could set up the experiment so that
the reviewer reviews a page WITHOUT any of your colourings, and then you
compute the overlap between their changes, and the segments that your system
thought were untrustworthy. By doing it that way, you would be avoiding the
issue of favourable or disfavourable evaluator bias towards the system
(because the evaluator does not know which segments the system deems
unreliable). Also, you would be catching both false positive and false
negatives (whereas the way I evaluated the system, I could only catch false
Another thought is that maybe you should not evaluate the system's ability
to rate trustworthiness of **segments**, but rather rate the trustworthiness
of whole pages. In other words, it could be that if you focus the user's
attention on pages that have a large proportion of red in them, you would
have very few false positives on that task (of course you might have lots of
false negatives too, but it's still better than what we have now which is
NOTHING). For a task like that, you would of course have to compare your
system to a naïve implementation which, for example, uses a pages's "age"
i.e. elapsed time since initial creation), or the number of edits by
different people, or the number of visits by different people, as an
indication of trustworthiness. Have you looked at how your measure
correlates with the review board's evaluation of page quality?
Earlier, you also said you didn't think the algorithm could do a good job
at predicting what parts of the text are questionable because so many good
contributions are made by occasional one-off contributors and anonymous
authors. Maybe all this means is that you need to put your threshold for
colouring at a higher value. In other words, only colour those parts which
have been written by people who are KNOWN to be poor contributors. Also, for
anonymous contributors, do you treat all of them as one big "user", or do
you try to distinguish by IP address? Have you tried eliminating anonymous
contributions from your processing altogether? Have you tried eliminating
contributors who only made contributions to < N pages? How do these things
affect the values of "internal" metrics you mentioned in your previous
Finally, it may be that this tool is more useful for reviewers and editors
than for readers of wikipedia. So, what would be good metrics for reviewers?
* Precision/Recall of pages that are low quality.
* Precsion/Recall of segments in those low quality pages that are low
* Productivity boost when reviewing pages using this system vs not. For
example, does a reviewer using this system end up doing more edits per hour
than a reviewer who does not?
That's it for now. Like I said, I'm sure they altogether amount to a lot
of work, and I don't expect your team to do it all. I just offer it as a
list of things that might be interesting for you guys to look at.