Dear All,
we have a demo at http://wiki-trust.cse.ucsc.edu/ that features the whole English Wikipedia, as of its February 6, 2007 snapshot, colored according to text trust. This is the first time that even we can look at how the "trust coloring" looks on the whole of the Wikipedia! We would be very interested in feedback (the wikiquality-l@lists.wikimedia.org mailing list is the best place).
If you find bugs, you can email us at http://groups.google.com/group/wiki-trust
Happy Holidays!
Luca
PS: yes, we know, some images look off. It is currently fairly difficult for a site outside of the Wikipedia to fetch Wikipedia images correctly.
PPS: there are going to be a few planned power outages on our campus in the next days, so if the demo is off, try again later.
Here is my feedback based on looking at a few pages on topics that I know very well.
Agile Software Development
· http://wiki-trust.cse.ucsc.edu/index.php/Agile_software_development
· Not bad. I counted 13 highlighted items, 5 of which I would say are questionable.
Usability
· http://wiki-trust.cse.ucsc.edu/index.php/Usability
· Not as good. 14 highlighted items 3 of which I would say are questionable.
Open Source Software
· http://wiki-trust.cse.ucsc.edu/index.php/Open_source_software
· Not so good either. 23 highlighted items, 3 of which I would say are questionable.
This is a very small sample, but it's all I have time to do. It will be interesting to see how other people rate the precision of the highlightings on a wider set of topics. Based on these three examples, it's not entirely clear to me that this system would help me identify questionable items in topics that I am not so familiar with.
Are you planning to do a larger scale evaluation with human judges? An issue in that kind of study is to avoid favourable or disfavourable bias on the part of the judges. Also, you have to make sure that your algorithm is doing better than random guessing (in other words, there may be so many questionable phrases in a wiki page that random guessing would be bound to guess right ounce out of every say, 5 times). One way to avoid these issues would be to produce pages where half of the highlightings are produced by your system, and the other half are highlighting a randomly selected contiguous contribution by a single author.
I think this is really interesting work worth doing, btw. I just don't know how useful it is in its current state.
Cheers,
Alain Désilets
Alain -- Is it true that although you've seen 3x to 15x false positive, that you did not see any false negatives? By false negative I would mean a questionable item that was not highlighted. Maybe you weren't looking for these? Best regards. -- Ward
__________________ Ward Cunningham 503-432-5682
On Dec 20, 2007, at 6:12 AM, Desilets, Alain wrote:
Here is my feedback based on looking at a few pages on topics that I know very well.
Agile Software Development · http://wiki-trust.cse.ucsc.edu/index.php/Agile_software_development · Not bad. I counted 13 highlighted items, 5 of which I would say are questionable.
Usability · http://wiki-trust.cse.ucsc.edu/index.php/Usability · Not as good. 14 highlighted items 3 of which I would say are questionable.
Open Source Software · http://wiki-trust.cse.ucsc.edu/index.php/Open_source_software · Not so good either. 23 highlighted items, 3 of which I would say are questionable.
This is a very small sample, but it’s all I have time to do. It will be interesting to see how other people rate the precision of the highlightings on a wider set of topics. Based on these three examples, it’s not entirely clear to me that this system would help me identify questionable items in topics that I am not so familiar with.
Are you planning to do a larger scale evaluation with human judges? An issue in that kind of study is to avoid favourable or disfavourable bias on the part of the judges. Also, you have to make sure that your algorithm is doing better than random guessing (in other words, there may be so many questionable phrases in a wiki page that random guessing would be bound to guess right ounce out of every say, 5 times). One way to avoid these issues would be to produce pages where half of the highlightings are produced by your system, and the other half are highlighting a randomly selected contiguous contribution by a single author.
I think this is really interesting work worth doing, btw. I just don’t know how useful it is in its current state.
Cheers,
Alain Désilets
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Good point.
I did not look for false negatives because of lack of time. That would have required me to read the whole content of the pages, and look for items that I thought were questionable eventhough they weren't highlighted by the system.
I agree with you that false negatives may be just as important as false positives. It's just more work to evaluate that metric.
Alain
From: wiki-research-l-bounces@lists.wikimedia.org [mailto:wiki-research-l-bounces@lists.wikimedia.org] On Behalf Of Ward Cunningham Sent: December 20, 2007 11:01 AM To: Research into Wikimedia content and communities Subject: Re: [Wiki-research-l] Wikipedia colored according to trust
Alain -- Is it true that although you've seen 3x to 15x false positive, that you did not see any false negatives? By false negative I would mean a questionable item that was not highlighted. Maybe you weren't looking for these? Best regards. -- Ward
__________________
Ward Cunningham
503-432-5682
On Dec 20, 2007, at 6:12 AM, Desilets, Alain wrote:
Here is my feedback based on looking at a few pages on topics that I know very well.
Agile Software Development
· http://wiki-trust.cse.ucsc.edu/index.php/Agile_software_development
· Not bad. I counted 13 highlighted items, 5 of which I would say are questionable.
Usability
· http://wiki-trust.cse.ucsc.edu/index.php/Usability
· Not as good. 14 highlighted items 3 of which I would say are questionable.
Open Source Software
· http://wiki-trust.cse.ucsc.edu/index.php/Open_source_software
· Not so good either. 23 highlighted items, 3 of which I would say are questionable.
This is a very small sample, but it's all I have time to do. It will be interesting to see how other people rate the precision of the highlightings on a wider set of topics. Based on these three examples, it's not entirely clear to me that this system would help me identify questionable items in topics that I am not so familiar with.
Are you planning to do a larger scale evaluation with human judges? An issue in that kind of study is to avoid favourable or disfavourable bias on the part of the judges. Also, you have to make sure that your algorithm is doing better than random guessing (in other words, there may be so many questionable phrases in a wiki page that random guessing would be bound to guess right ounce out of every say, 5 times). One way to avoid these issues would be to produce pages where half of the highlightings are produced by your system, and the other half are highlighting a randomly selected contiguous contribution by a single author.
I think this is really interesting work worth doing, btw. I just don't know how useful it is in its current state.
Cheers,
Alain Désilets
_______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wiki-research-l
You are evaluating the coloring against a performance criterion that is not the one we designed it for.
Our coloring gives orange color to new information that has been added by low-reputation authors. New information by high-reputation authors is light orange. As the information is revised, it gains trust.
Thus, our coloring answers the question, intuitively: has this information been revised already? Have reputable authors looked at it?
You are asking the question: how much information colored orange is questionable? This is a different question, and we will never be able to do well, for the simple reason that it is well known that a lot of the correct factual information on Wikipedia comes from occasional contributors, including anonymous authors, and those occasional contributors and anonymous will have low reputation in most conceivable reputation systems.
We do not plan to do any large-scale human study. For one, we don't have the resources. For another, in the very limited tests we did, the notion of "questionable" was so subjective that our data contained a HUGE amount of noise. We asked to rank edits as -1 (bad), 0 (neutral), +1 (good). The probability that two of us agreed was somewhere below 60%. We decided this was not a good way to go.
The results of our data-driven evaluation on a random sample of 1000 articles with at least 200 revisions each showed that (quoting from our paper):
- Recall of deletions. We consider the recall of low-trust as a predictor for deletions. We show that text in the lowest 50% of trust values constitutes only 3.4% of the text of articles, yet corresponds to 66% of the text that is deleted from one revision to the next.
- Precision of deletions. We consider the precision of low-trust as a predictor for deletions. We show that text that is in the bottom half of trust values has a probability of 33% of being deleted in the very next revision, in contrast with the 1.9% probability for general text. The deletion probability raises to 62% for text in the bottom 20% of trust values.
- Trust of average vs. deleted text. We consider the trust distribution of all text, compared to the trust distribution to the text that is deleted. We show that 90% of the text overall had trust at least 76%, while the average trust for deleted text was 33%.
- Trust as a predictor of lifespan. We select words uniformly at random, and we consider the statistical correlation between the trust of the word at the moment of sampling, and the future lifespan of the word. We show that words with the highest trust have an expected future lifespan that is 4.5 times longer than words with no trust. We remark that this is a proper test, since the trust at the time of sampling depends only on the history of the word prior to sampling.
Luca
On Dec 20, 2007 6:12 AM, Desilets, Alain Alain.Desilets@nrc-cnrc.gc.ca wrote:
Here is my feedback based on looking at a few pages on topics that I know very well.
Agile Software Development
· http://wiki-trust.cse.ucsc.edu/index.php/Agile_software_development
· Not bad. I counted 13 highlighted items, 5 of which I would say are questionable.
Usability
· http://wiki-trust.cse.ucsc.edu/index.php/Usability
· Not as good. 14 highlighted items 3 of which I would say are questionable.
Open Source Software
· http://wiki-trust.cse.ucsc.edu/index.php/Open_source_software
· Not so good either. 23 highlighted items, 3 of which I would say are questionable.
This is a very small sample, but it's all I have time to do. It will be interesting to see how other people rate the precision of the highlightings on a wider set of topics. Based on these three examples, it's not entirely clear to me that this system would help me identify questionable items in topics that I am not so familiar with.
Are you planning to do a larger scale evaluation with human judges? An issue in that kind of study is to avoid favourable or disfavourable bias on the part of the judges. Also, you have to make sure that your algorithm is doing better than random guessing (in other words, there may be so many questionable phrases in a wiki page that random guessing would be bound to guess right ounce out of every say, 5 times). One way to avoid these issues would be to produce pages where half of the highlightings are produced by your system, and the other half are highlighting a randomly selected contiguous contribution by a single author.
I think this is really interesting work worth doing, btw. I just don't know how useful it is in its current state.
Cheers,
Alain Désilets
It seems that what Ward and others are getting at is that it would be useful to have precision and recall measures for Luca's trust metric. Of course, the metric can't possibly know it when a brand new user contributes unusually high quality text to the encyclopedia. Nonetheless, it seems that a tool such as Amazon's Mechanical Turk could allow us to easily measure how often false positives and false negatives occur using random sampling. Although your hammer was not designed for their nail, I imagine it would do quite well.
On Dec 20, 2007 9:04 AM, Luca de Alfaro luca@soe.ucsc.edu wrote:
You are evaluating the coloring against a performance criterion that is not the one we designed it for.
Our coloring gives orange color to new information that has been added by low-reputation authors. New information by high-reputation authors is light orange. As the information is revised, it gains trust.
Thus, our coloring answers the question, intuitively: has this information been revised already? Have reputable authors looked at it?
You are asking the question: how much information colored orange is questionable? This is a different question, and we will never be able to do well, for the simple reason that it is well known that a lot of the correct factual information on Wikipedia comes from occasional contributors, including anonymous authors, and those occasional contributors and anonymous will have low reputation in most conceivable reputation systems.
We do not plan to do any large-scale human study. For one, we don't have the resources. For another, in the very limited tests we did, the notion of "questionable" was so subjective that our data contained a HUGE amount of noise. We asked to rank edits as -1 (bad), 0 (neutral), +1 (good). The probability that two of us agreed was somewhere below 60%. We decided this was not a good way to go.
The results of our data-driven evaluation on a random sample of 1000 articles with at least 200 revisions each showed that (quoting from our paper):
- Recall of deletions. We consider the recall of low-trust as a
predictor for deletions. We show that text in the lowest 50% of trust values constitutes only 3.4% of the text of articles, yet corresponds to 66% of the text that is deleted from one revision to the next.
- Precision of deletions. We consider the precision of low-trust as
a predictor for deletions. We show that text that is in the bottom half of trust values has a probability of 33% of being deleted in the very next revision, in contrast with the 1.9% probability for general text. The deletion probability raises to 62% for text in the bottom 20% of trust values.
- Trust of average vs. deleted text. We consider the trust
distribution of all text, compared to the trust distribution to the text that is deleted. We show that 90% of the text overall had trust at least 76%, while the average trust for deleted text was 33%.
- Trust as a predictor of lifespan. We select words uniformly at
random, and we consider the statistical correlation between the trust of the word at the moment of sampling, and the future lifespan of the word. We show that words with the highest trust have an expected future lifespan that is 4.5 times longer than words with no trust. We remark that this is a proper test, since the trust at the time of sampling depends only on the history of the word prior to sampling.
Luca
On Dec 20, 2007 6:12 AM, Desilets, Alain Alain.Desilets@nrc-cnrc.gc.ca wrote:
Here is my feedback based on looking at a few pages on topics that I know very well.
Agile Software Development
· http://wiki-trust.cse.ucsc.edu/index.php/Agile_software_development
· Not bad. I counted 13 highlighted items, 5 of which I would say are questionable.
Usability
· http://wiki-trust.cse.ucsc.edu/index.php/Usability
· Not as good. 14 highlighted items 3 of which I would say are questionable.
Open Source Software
· http://wiki-trust.cse.ucsc.edu/index.php/Open_source_software
· Not so good either. 23 highlighted items, 3 of which I would say are questionable.
This is a very small sample, but it's all I have time to do. It will be interesting to see how other people rate the precision of the highlightings on a wider set of topics. Based on these three examples, it's not entirely clear to me that this system would help me identify questionable items in topics that I am not so familiar with.
Are you planning to do a larger scale evaluation with human judges? An issue in that kind of study is to avoid favourable or disfavourable bias on the part of the judges. Also, you have to make sure that your algorithm is doing better than random guessing (in other words, there may be so many questionable phrases in a wiki page that random guessing would be bound to guess right ounce out of every say, 5 times). One way to avoid these issues would be to produce pages where half of the highlightings are produced by your system, and the other half are highlighting a randomly selected contiguous contribution by a single author.
I think this is really interesting work worth doing, btw. I just don't know how useful it is in its current state.
Cheers,
Alain Désilets
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wiki-research-l
You are evaluating the coloring against a performance criterion that is not the one we designed it for.
Our coloring gives orange color to new information that has been added by low-reputation authors. New information by high-reputation authors is light orange. As the information is revised, it gains trust.
Thus, our coloring answers the question, intuitively: has this information been revised already? Have reputable authors looked at it? You are asking the question: how much information colored orange is questionable? This is a different question, and we will never be able to do well, for the simple reason that it is well known that a lot of the correct factual information on Wikipedia comes from occasional contributors, including anonymous authors, and those occasional contributors and anonymous will have low reputation in most conceivable reputation systems.
Before I go further, let me reiterate that I think your work is excellent and has the potential for adding huge value to the wiki world. If I didn't think so, I wouldn't bother writing this message.
I think it's important to evaluate a system like this in terms of a metric that captures some sort of value added to some category of wiki end user.
The system you are trying to build could provide HUGE value for the end user, if it could allow him to tell with a certain amount of certainty (say, > 60%) which parts of the system are questionable and which parts are not. This is the metric I used in my admittedly very small test (Note: I'm sure it's not the only metric that could be used to measure end-user value).
Based on that very preliminary test, it seems your system does not do a great job at that, and you seem to say that you don't think it could.
That's OK. I'm sure there is SOMETHING that this system can do for the end user, because he "internal" performance metrics you list in your message seem to indicate that there is some substance to the predictions of the algorithm.
We do not plan to do any large-scale human study. For one, we don't have the resources.
A study with human judges does not have to be large scale. I would guess 30 subjects would do the trick.
For another, in the very limited tests we did, the notion of "questionable" was so subjective that our data contained a HUGE amount of noise. We asked to rank edits as -1 (bad), 0 (neutral), +1 (good). The probability that two of us agreed was somewhere below 60%. We decided this was not a good way to go.
That's interesting. I would have expected a large amount of agreement based on my assumption that the majority of edits are either clearly Good or Neutral. In other words, I would have expected judges to disagree only on the "iffy" portion of the edits, but since I assume that this is a small portion of all edits, you would still have large agreement. I guess my assumptions are wrong.
Is the story the same if you look at only two categories: Reject (= your {-1} set) and Keep (your {0, +1} set)?
The results of our data-driven evaluation on a random sample of 1000 articles with at least 200 revisions each showed that (quoting from our paper):
- Recall of deletions. We consider the recall of low-trust as a predictor for deletions. We show that text in the
lowest 50% of trust values constitutes only 3.4% of the text of articles, yet corresponds to 66% of the text that is deleted from one revision to the next.
- Precision of deletions. We consider the precision of low-trust as a predictor for deletions. We show that text
that is in the bottom half of trust values has a probability of 33% of being deleted in the very next revision, in contrast with the 1.9% probability for general text. The deletion probability raises to 62% for text in the bottom 20% of trust values.
- Trust of average vs. deleted text. We consider the trust distribution of all text, compared to the trust
distribution to the text that is deleted. We show that 90% of the text overall had trust at least 76%, while the average trust for deleted text was 33%.
- Trust as a predictor of lifespan. We select words uniformly at random, and we consider the statistical correlation
between the trust of the word at the moment of sampling, and the future lifespan of the word. We show that words with the highest trust have an expected future lifespan that is 4.5 times longer than words with no trust. We remark that this is a proper test, since the trust at the time of sampling depends only on the history of the word prior to sampling.
Those measures tell me that there is definitely something to the algorithm, and I am trying to help you define what value it could provide to the which kind of end user.
One concern I have though. Have you compared your system to a naïve implementation which simply uses the edit's "age" as a measure of its trustworthiness? In other words, don't worry about who created the edit or modified it. Just worry about how long it's been there.
Alain
More randomly ordered thoughts on this. I'm sure they altogether amount to a lot of work, and I don't expect your team to do it all. I just offer it as a list of things that might be interesting for you guys to look at.
Earlier, you said you didn't have the resources to conduct a study with human subjects. I just wanted to point out again that you may be overestimating the time it takes to put one of those together. Putting together a web site where people can go and volunteer to evaluate the results of your algorithm on a page they know well would probably require a fraction of the time it took your team to develop the algorithm itself (I'm sure dealing with that much data, and figuring out who wrote what contiguous parts of text took a lot of tinkering). I'm sure it would not be hard to convince editors and reviewers on wikipedia to volunteer to review 30-50 pages through such a special site. You could set up the experiment so that the reviewer reviews a page WITHOUT any of your colourings, and then you compute the overlap between their changes, and the segments that your system thought were untrustworthy. By doing it that way, you would be avoiding the issue of favourable or disfavourable evaluator bias towards the system (because the evaluator does not know which segments the system deems unreliable). Also, you would be catching both false positive and false negatives (whereas the way I evaluated the system, I could only catch false positives).
Another thought is that maybe you should not evaluate the system's ability to rate trustworthiness of **segments**, but rather rate the trustworthiness of whole pages. In other words, it could be that if you focus the user's attention on pages that have a large proportion of red in them, you would have very few false positives on that task (of course you might have lots of false negatives too, but it's still better than what we have now which is NOTHING). For a task like that, you would of course have to compare your system to a naïve implementation which, for example, uses a pages's "age" (i.e. elapsed time since initial creation), or the number of edits by different people, or the number of visits by different people, as an indication of trustworthiness. Have you looked at how your measure correlates with the review board's evaluation of page quality?
Earlier, you also said you didn't think the algorithm could do a good job at predicting what parts of the text are questionable because so many good contributions are made by occasional one-off contributors and anonymous authors. Maybe all this means is that you need to put your threshold for colouring at a higher value. In other words, only colour those parts which have been written by people who are KNOWN to be poor contributors. Also, for anonymous contributors, do you treat all of them as one big "user", or do you try to distinguish by IP address? Have you tried eliminating anonymous contributions from your processing altogether? Have you tried eliminating contributors who only made contributions to < N pages? How do these things affect the values of "internal" metrics you mentioned in your previous email.
Finally, it may be that this tool is more useful for reviewers and editors than for readers of wikipedia. So, what would be good metrics for reviewers? * Precision/Recall of pages that are low quality. * Precsion/Recall of segments in those low quality pages that are low quality * Productivity boost when reviewing pages using this system vs not. For example, does a reviewer using this system end up doing more edits per hour than a reviewer who does not?
That's it for now. Like I said, I'm sure they altogether amount to a lot of work, and I don't expect your team to do it all. I just offer it as a list of things that might be interesting for you guys to look at.
Cheers,
Alain
Dear Alain,
I would like to encourage you to do such a study. We can also provide data for you. In a sense, an independent study would be even better.
The project currently has me (30% time; I also do other research, teach, work in Univ committees, etc), Ian (full time, except that he is taking classes, as he should), and Bo (20% time, as he is also working on other things). We have to be very careful in prioritizing things to do. Also, we now have a tiny bit of funding, and while this enables me to fund students and pay for machines, it also means that I cannot do a user study on a spur of the moment - I need the approval of the Ethics Board of my university, and to get that, I need to apply, talk to them, etc -- it all takes time. Finally, my experience is that a user study is hardly ever simple. First you start, then you realize that you are asking the wrong questions, then you redo it, then you figure there is too much noise in the data, then you redo it, then you realize the data analysis does not quite work because what you really needed were these other data.... and the data analysis is also not simple: how to sample pages, how to sample text...
But as I say, I think it would be great if you did it, and we could provide you the data you need.
Luca
On Dec 21, 2007 3:04 AM, Desilets, Alain Alain.Desilets@nrc-cnrc.gc.ca wrote:
More randomly ordered thoughts on this. I'm sure they altogether amount to a lot of work, and I don't expect your team to do it all. I just offer it as a list of things that might be interesting for you guys to look at.
Earlier, you said you didn't have the resources to conduct a study with human subjects. I just wanted to point out again that you may be overestimating the time it takes to put one of those together. Putting together a web site where people can go and volunteer to evaluate the results of your algorithm on a page they know well would probably require a fraction of the time it took your team to develop the algorithm itself (I'm sure dealing with that much data, and figuring out who wrote what contiguous parts of text took a lot of tinkering). I'm sure it would not be hard to convince editors and reviewers on wikipedia to volunteer to review 30-50 pages through such a special site. You could set up the experiment so that the reviewer reviews a page WITHOUT any of your colourings, and then you compute the overlap between their changes, and the segments that your system thought were untrustworthy. By doing it that way, you would be avoiding the issue of favourable or disfavourable evaluator bias towards the system (because the evaluator does not know which segments the system deems unreliable). Also, you would be catching both false positive and false negatives (whereas the way I evaluated the system, I could only catch false positives).
Another thought is that maybe you should not evaluate the system's ability to rate trustworthiness of **segments**, but rather rate the trustworthiness of whole pages. In other words, it could be that if you focus the user's attention on pages that have a large proportion of red in them, you would have very few false positives on that task (of course you might have lots of false negatives too, but it's still better than what we have now which is NOTHING). For a task like that, you would of course have to compare your system to a naïve implementation which, for example, uses a pages's "age" ( i.e. elapsed time since initial creation), or the number of edits by different people, or the number of visits by different people, as an indication of trustworthiness. Have you looked at how your measure correlates with the review board's evaluation of page quality?
Earlier, you also said you didn't think the algorithm could do a good job at predicting what parts of the text are questionable because so many good contributions are made by occasional one-off contributors and anonymous authors. Maybe all this means is that you need to put your threshold for colouring at a higher value. In other words, only colour those parts which have been written by people who are KNOWN to be poor contributors. Also, for anonymous contributors, do you treat all of them as one big "user", or do you try to distinguish by IP address? Have you tried eliminating anonymous contributions from your processing altogether? Have you tried eliminating contributors who only made contributions to < N pages? How do these things affect the values of "internal" metrics you mentioned in your previous email.
Finally, it may be that this tool is more useful for reviewers and editors than for readers of wikipedia. So, what would be good metrics for reviewers?
- Precision/Recall of pages that are low quality.
- Precsion/Recall of segments in those low quality pages that are low
quality
- Productivity boost when reviewing pages using this system vs not. For
example, does a reviewer using this system end up doing more edits per hour than a reviewer who does not?
That's it for now. Like I said, I'm sure they altogether amount to a lot of work, and I don't expect your team to do it all. I just offer it as a list of things that might be interesting for you guys to look at.
Cheers,
Alain
I understand that your time is limited and that an evaluation with human subjects may not be what you want to do.
I unfortunately do not have time to do such a study either, as I am quite overcommitted myself with 6 projects on the go at once.
Cheers,
Alain
From: luca.de.alfaro@gmail.com [mailto:luca.de.alfaro@gmail.com] On Behalf Of Luca de Alfaro Sent: December 21, 2007 3:12 PM To: Desilets, Alain Cc: Research into Wikimedia content and communities Subject: Re: [Wiki-research-l] Wikipedia colored according to trust
Dear Alain,
I would like to encourage you to do such a study. We can also provide data for you. In a sense, an independent study would be even better.
The project currently has me (30% time; I also do other research, teach, work in Univ committees, etc), Ian (full time, except that he is taking classes, as he should), and Bo (20% time, as he is also working on other things). We have to be very careful in prioritizing things to do. Also, we now have a tiny bit of funding, and while this enables me to fund students and pay for machines, it also means that I cannot do a user study on a spur of the moment - I need the approval of the Ethics Board of my university, and to get that, I need to apply, talk to them, etc -- it all takes time. Finally, my experience is that a user study is hardly ever simple. First you start, then you realize that you are asking the wrong questions, then you redo it, then you figure there is too much noise in the data, then you redo it, then you realize the data analysis does not quite work because what you really needed were these other data.... and the data analysis is also not simple: how to sample pages, how to sample text...
But as I say, I think it would be great if you did it, and we could provide you the data you need.
Luca
On Dec 21, 2007 3:04 AM, Desilets, Alain < Alain.Desilets@nrc-cnrc.gc.ca mailto:Alain.Desilets@nrc-cnrc.gc.ca > wrote:
More randomly ordered thoughts on this. I'm sure they altogether amount to a lot of work, and I don't expect your team to do it all. I just offer it as a list of things that might be interesting for you guys to look at.
Earlier, you said you didn't have the resources to conduct a study with human subjects. I just wanted to point out again that you may be overestimating the time it takes to put one of those together. Putting together a web site where people can go and volunteer to evaluate the results of your algorithm on a page they know well would probably require a fraction of the time it took your team to develop the algorithm itself (I'm sure dealing with that much data, and figuring out who wrote what contiguous parts of text took a lot of tinkering). I'm sure it would not be hard to convince editors and reviewers on wikipedia to volunteer to review 30-50 pages through such a special site. You could set up the experiment so that the reviewer reviews a page WITHOUT any of your colourings, and then you compute the overlap between their changes, and the segments that your system thought were untrustworthy. By doing it that way, you would be avoiding the issue of favourable or disfavourable evaluator bias towards the system (because the evaluator does not know which segments the system deems unreliable). Also, you would be catching both false positive and false negatives (whereas the way I evaluated the system, I could only catch false positives).
Another thought is that maybe you should not evaluate the system's ability to rate trustworthiness of **segments**, but rather rate the trustworthiness of whole pages. In other words, it could be that if you focus the user's attention on pages that have a large proportion of red in them, you would have very few false positives on that task (of course you might have lots of false negatives too, but it's still better than what we have now which is NOTHING). For a task like that, you would of course have to compare your system to a naïve implementation which, for example, uses a pages's "age" ( i.e. elapsed time since initial creation), or the number of edits by different people, or the number of visits by different people, as an indication of trustworthiness. Have you looked at how your measure correlates with the review board's evaluation of page quality?
Earlier, you also said you didn't think the algorithm could do a good job at predicting what parts of the text are questionable because so many good contributions are made by occasional one-off contributors and anonymous authors. Maybe all this means is that you need to put your threshold for colouring at a higher value. In other words, only colour those parts which have been written by people who are KNOWN to be poor contributors. Also, for anonymous contributors, do you treat all of them as one big "user", or do you try to distinguish by IP address? Have you tried eliminating anonymous contributions from your processing altogether? Have you tried eliminating contributors who only made contributions to < N pages? How do these things affect the values of "internal" metrics you mentioned in your previous email.
Finally, it may be that this tool is more useful for reviewers and editors than for readers of wikipedia. So, what would be good metrics for reviewers? * Precision/Recall of pages that are low quality. * Precsion/Recall of segments in those low quality pages that are low quality * Productivity boost when reviewing pages using this system vs not. For example, does a reviewer using this system end up doing more edits per hour than a reviewer who does not?
That's it for now. Like I said, I'm sure they altogether amount to a lot of work, and I don't expect your team to do it all. I just offer it as a list of things that might be interesting for you guys to look at.
Cheers,
Alain
wiki-research-l@lists.wikimedia.org