I understand that your time is limited and that an evaluation
with human subjects may not be what you want to do.
I unfortunately do not have time to do such a study either, as I
am quite overcommitted myself with 6 projects on the go at once.
Cheers,
Alain
From: luca.de.alfaro@gmail.com
[mailto:luca.de.alfaro@gmail.com] On Behalf Of Luca de Alfaro
Sent: December 21, 2007 3:12 PM
To: Desilets, Alain
Cc: Research into Wikimedia content and communities
Subject: Re: [Wiki-research-l] Wikipedia colored according to trust
Dear Alain,
I would like to encourage you to do such a study. We can also provide
data for you.
In a sense, an independent study would be even better.
The project currently has me (30% time; I also do other research, teach, work
in Univ committees, etc), Ian (full time, except that he is taking classes, as
he should), and Bo (20% time, as he is also working on other things).
We have to be very careful in prioritizing things to do.
Also, we now have a tiny bit of funding, and while this enables me to fund
students and pay for machines, it also means that I cannot do a user study on a
spur of the moment - I need the approval of the Ethics Board of my university,
and to get that, I need to apply, talk to them, etc -- it all takes time.
Finally, my experience is that a user study is hardly ever simple. First
you start, then you realize that you are asking the wrong questions, then you
redo it, then you figure there is too much noise in the data, then you
redo it, then you realize the data analysis does not quite work because what
you really needed were these other data.... and the data analysis is also not
simple: how to sample pages, how to sample text...
But as I say, I think it would be great if you did it, and we could provide you
the data you need.
Luca
On Dec 21, 2007 3:04 AM, Desilets, Alain < Alain.Desilets@nrc-cnrc.gc.ca>
wrote:
More randomly ordered thoughts on this. I'm sure they
altogether amount to a lot of work, and I don't expect your team to do it all.
I just offer it as a list of things that might be interesting for you guys to
look at.
Earlier, you said you didn't have the resources to conduct a study with human
subjects. I just wanted to point out again that you may be overestimating the
time it takes to put one of those together. Putting together a web site where
people can go and volunteer to evaluate the results of your algorithm on a page
they know well would probably require a fraction of the time it took your team
to develop the algorithm itself (I'm sure dealing with that much data, and
figuring out who wrote what contiguous parts of text took a lot of tinkering).
I'm sure it would not be hard to convince editors and reviewers on wikipedia to
volunteer to review 30-50 pages through such a special site. You could set up
the experiment so that the reviewer reviews a page WITHOUT any of your
colourings, and then you compute the overlap between their changes, and the
segments that your system thought were untrustworthy. By doing it that way, you
would be avoiding the issue of favourable or disfavourable evaluator bias
towards the system (because the evaluator does not know which segments the
system deems unreliable). Also, you would be catching both false positive and
false negatives (whereas the way I evaluated the system, I could only catch
false positives).
Another thought is that maybe you should not evaluate the system's ability to
rate trustworthiness of **segments**, but rather rate the trustworthiness of
whole pages. In other words, it could be that if you focus the user's attention
on pages that have a large proportion of red in them, you would have very few
false positives on that task (of course you might have lots of false negatives
too, but it's still better than what we have now which is NOTHING). For a task
like that, you would of course have to compare your system to a naïve
implementation which, for example, uses a pages's "age" ( i.e.
elapsed time since initial creation), or the number of edits by different
people, or the number of visits by different people, as an indication of
trustworthiness. Have you looked at how your measure correlates with the review
board's evaluation of page quality?
Earlier, you also said you didn't think the algorithm could do a good job at
predicting what parts of the text are questionable because so many good
contributions are made by occasional one-off contributors and anonymous
authors. Maybe all this means is that you need to put your threshold for
colouring at a higher value. In other words, only colour those parts which have
been written by people who are KNOWN to be poor contributors. Also, for
anonymous contributors, do you treat all of them as one big "user",
or do you try to distinguish by IP address? Have you tried eliminating
anonymous contributions from your processing altogether? Have you tried
eliminating contributors who only made contributions to < N pages? How do
these things affect the values of "internal" metrics you mentioned in
your previous email.
Finally, it may be that this tool is more useful for reviewers and editors than
for readers of wikipedia. So, what would be good metrics for reviewers?
* Precision/Recall of pages that are low quality.
* Precsion/Recall of segments in those low quality pages that are low quality
* Productivity boost when reviewing pages using this system vs not. For
example, does a reviewer using this system end up doing more edits per hour
than a reviewer who does not?
That's it for now. Like I said, I'm sure they altogether amount to a lot of
work, and I don't expect your team to do it all. I just offer it as a list of
things that might be interesting for you guys to look at.
Cheers,
Alain