It seems that what Ward and others are getting at is that it would be useful to have precision and recall measures for Luca's trust metric. Of course, the metric can't possibly know it when a brand new user contributes unusually high quality text to the encyclopedia. Nonetheless, it seems that a tool such as Amazon's Mechanical Turk could allow us to easily measure how often false positives and false negatives occur using random sampling. Although your hammer was not designed for their nail, I imagine it would do quite well.


On Dec 20, 2007 9:04 AM, Luca de Alfaro <luca@soe.ucsc.edu> wrote:
You are evaluating the coloring against a performance criterion that is not the one we designed it for.

Our coloring gives orange color to new information that has been added by low-reputation authors.  New information by high-reputation authors is light orange.  As the information is revised, it gains trust.

Thus, our coloring answers the question, intuitively: has this information been revised already?  Have reputable authors looked at it?
 
You are asking the question: how much information colored orange is questionable?
This is a different question, and we will never be able to do well, for the simple reason that it is well known that a lot of the correct factual information on Wikipedia comes from occasional contributors, including anonymous authors, and those occasional contributors and anonymous will have low reputation in most conceivable reputation systems.

We do not plan to do any large-scale human study.  For one, we don't have the resources.  For another, in the very limited tests we did, the notion of "questionable" was so subjective that our data contained a HUGE amount of noise.  We asked to rank edits as -1 (bad), 0 (neutral), +1 (good).  The probability that two of us agreed was somewhere below 60%.  We decided this was not a good way to go.

The results of our data-driven evaluation on a random sample of 1000 articles with at least 200 revisions each showed that (quoting from our paper):
Luca


On Dec 20, 2007 6:12 AM, Desilets, Alain <Alain.Desilets@nrc-cnrc.gc.ca> wrote:

Here is my feedback based on looking at a few pages on topics that I know very well.

 

Agile Software Development

ˇ        http://wiki-trust.cse.ucsc.edu/index.php/Agile_software_development

ˇ        Not bad. I counted 13 highlighted items, 5 of which I would say are questionable.

 

Usability

ˇ        http://wiki-trust.cse.ucsc.edu/index.php/Usability

ˇ        Not as good. 14 highlighted items 3 of which I would say are questionable.

 

Open Source Software

ˇ        http://wiki-trust.cse.ucsc.edu/index.php/Open_source_software

ˇ        Not so good either. 23 highlighted items, 3 of which I would say are questionable.

 

This is a very small sample, but it's all I have time to do. It will be interesting to see how other people rate the precision of the highlightings on a wider set of topics. Based on these three examples, it's not entirely clear to me that this system would help me identify questionable items in topics that I am not so familiar with.

 

Are you planning to do a larger scale evaluation with human judges? An issue in that kind of study is to avoid favourable or disfavourable bias on the part of the judges. Also, you have to make sure that your algorithm is doing better than random guessing (in other words, there may be so many questionable phrases in a wiki page that random guessing would be bound to guess right ounce out of every say, 5 times). One way to avoid these issues would be to produce pages where half of the highlightings are produced by your system, and the other half are highlighting a randomly selected contiguous contribution by a single author.

 

I think this is really interesting work worth doing, btw. I just don't know how useful it is in its current state.

 

Cheers,

 

Alain Désilets

 



_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/wiki-research-l