Study on Interfaces to Improving Wikipedia Quality

List overview All Threads
Download

newer

older

[Help] About Participate in CentOS...

hel Re: Wikipedia-l Digest, Vol...

avani＠cs.umn.edu

19 Nov 2008 19 Nov '08

8:23 p.m.

Dear All,

My name is Avanidhar Chandrasekaran (http://en.wikipedia.org/wiki/User_talk:Avanidhar).

I work with GroupLens Research at the University of Minnesota, Twin Cities. As part of my research, I am involved in analyzing the usefulness and Necessity of author reputation in Wikipedia.

In lieu of this, I have simulated an Interface to color words in an article based on their Age.

Being experienced contributors to Wikipedia, I invite you to participate in this study, which involves the following.

1. Please visit the following Instances of wikipedia and evaluate the interface components which have been incorporated into each of them. Each of these use their own algorithm to color text.

a) The Wikitrust project

http://wiki-trust.cse.ucsc.edu/index.php/Main_Page

b) The Wiki-reputation project at Grouplens research

http://wiki-reputation.cs.umn.edu/index.php/Main_Page

2) Once you have evaluated the two interfaces, kindly complete this survey on Wikipedia quality

http://www.surveymonkey.com/s.aspx?sm=hagN5S1JZHxH6pF9SmXkkA_3d_3d

We hope to get your valuable feedback on these interfaces and how Wikipedia article quality can be improved.

Thanks for your time

Avanidhar Chandrasekaran,

GroupLens Research, University of Minnesota

Show replies by date

michael west

19 Nov 19 Nov

9:01 p.m.

2008/11/19 avani@cs.umn.edu

...

Dear All,

My name is Avanidhar Chandrasekaran (http://en.wikipedia.org/wiki/User_talk:Avanidhar).

I work with GroupLens Research at the University of Minnesota, Twin Cities. As part of my research, I am involved in analyzing the usefulness and Necessity of author reputation in Wikipedia.

In lieu of this, I have simulated an Interface to color words in an article based on their Age.

Being experienced contributors to Wikipedia, I invite you to participate in this study, which involves the following.

Please visit the following Instances of wikipedia and evaluate the

interface components which have been incorporated into each of them. Each of these use their own algorithm to color text.

a) The Wikitrust project

http://wiki-trust.cse.ucsc.edu/index.php/Main_Page

b) The Wiki-reputation project at Grouplens research

http://wiki-reputation.cs.umn.edu/index.php/Main_Page

Once you have evaluated the two interfaces, kindly complete this survey

on Wikipedia quality

http://www.surveymonkey.com/s.aspx?sm=hagN5S1JZHxH6pF9SmXkkA_3d_3d

We hope to get your valuable feedback on these interfaces and how Wikipedia article quality can be improved.

Thanks for your time

Avanidhar Chandrasekaran,

GroupLens Research, University of Minnesota

Quite interesting - the "age of words" color coding might be useful in detecting obtuse type vandalism.

Joseph Reagle

11:40 p.m.

On Wednesday 19 November 2008, avani@cs.umn.edu wrote:

...

We hope to get your valuable feedback on these interfaces and how Wikipedia article quality can be improved.

This might bias other respondants, but I thought it was an intersting idea so I wanted to share it. I concluded with the following which is no doubt affected by my being a WikiGnome:

[[ If I see an error, I fix it without much regard to time or author reputation. I do pay attention to and investigate author reputation on substantive issues on the discussion pages and it would be interesting to see a discussion thread colored according to reputation. ]]

Maury Markowitz

23 Nov 23 Nov

3:03 p.m.

On Wed, Nov 19, 2008 at 2:23 PM, avani@cs.umn.edu wrote:

...

We hope to get your valuable feedback on these interfaces and how Wikipedia article quality can be improved.

Given the older snapshots, I selected older articles that I had started, NuBUS and ARCNET.

The "time based" system from UMN did not work at all, every search resulted in a page not found.

The USCS system did work, but gave me odd results. Apparently I have a very bad reputation, because when I look in the History at the first versions, which I wrote in entirety, it colored it all yellow!

Newer versions of the same articles had much more white, even though huge portions of the text were still from the origial. This may be due to diff problems -- I consider diff to be largely random in effectiveness, sometimes it works, but othertimes a single whitespace change, especially vertical, will make it think the entire article was edited.

My guess is that the system is tripping over diffs like this, and thus considering the article to have been re-written by another editor. Since this has happened, MY reputation goes down, or so I understand it.

I don´t think this system could possibly work if based on wiki's diffs. If its going to work it´s going to need to use a much more reliable system.

Another problem I see with it is that it will rank an author who´s contributions are 1000 unchanged comma inserts to be as reliable as an author who created a perfect 1000 character article (or perhaps rate the first even higher). There should be some sort of length bias, if an author makes a big edit, out of character, that´s important to know.

Maury

Gregory Maxwell

3:44 p.m.

On Sun, Nov 23, 2008 at 9:03 AM, Maury Markowitz maury.markowitz@gmail.com wrote:

...

On Wed, Nov 19, 2008 at 2:23 PM, avani@cs.umn.edu wrote:

...
We hope to get your valuable feedback on these interfaces and how Wikipedia article quality can be improved.

Given the older snapshots, I selected older articles that I had started, NuBUS and ARCNET.

The "time based" system from UMN did not work at all, every search resulted in a page not found.

The UMN system intentionally included only a small number (70?) articles. This is why you needed to use the random page function to browse among them.

This doesn't reflect any short coming of the system, but it most likely just reflects the limits of computational resources they were working under.

[snip]

...

Newer versions of the same articles had much more white, even though huge portions of the text were still from the origial. This may be due to diff problems -- I consider diff to be largely random in effectiveness, sometimes it works, but othertimes a single whitespace change, especially vertical, will make it think the entire article was edited.

Yes, I had exactly the same experience with the USCS system: Different coloring for text I'd added in same edit which created the article. Quite inscrutable.

[snip]

...

Another problem I see with it is that it will rank an author who´s contributions are 1000 unchanged comma inserts to be as reliable as an author who created a perfect 1000 character article (or perhaps rate the first even higher). There should be some sort of length bias, if an author makes a big edit, out of character, that´s important to know.

For the articles it covered I found the UMN system to be more usable: It's output was more explicable, and the signal to noise ratio was just better. This may be partially due to bugs in the USCS history analysis, and different a different choice in coloring thresholds (USCS seemed to color almost everything, removing the usefulness of color as something to draw my attention).

Even so, I'm distrustful of "reputation" as an automated metric. Reputation is a fuzzy thing (consider your comma example), but time is just a straight forward metric which is much easier to get right. Your tireless and unreverted editing of external links tells me very little about your ability to make a reliable edit to the intro of an article, ... or at least very little that I didn't already know by merely knowing if your account was brand new or not. (New accounts are more likely to be used by inexperienced and ill-motivated persons)

I believe a metric applied correctly, consistently, and understandably is just going to be more useful than a metric which considers more data but is also subject to more noise. The differential performance between these two systems has done nothing but confirm my suspicions in this regard.

A simply objective challenge for any predictive coloring system would be to use them in the following experimental procedure:

* Take a dump of Wikipedia up a year old, use this as the underlying knowledge for the systems. * Make several random selections of articles and include the newer revisions not included in the initial set up to 6 months old. Call these the test sets. * The predictive coloring system should then take each revision in a test set in time order and predict if it will be reverted (Within X time?). * The actual edits up to now should be analyzed to determined which changes actually were reverted and when.

The final score will be the false positive and false negative rates. So long as e assume that the existing editing practices are not too bad we should find that the best predictive coloring system would generally tend to minimize these rates.

Luca de Alfaro

25 Nov 25 Nov

2:22 a.m.

I agree with Gregory that it is very useful to quantify the usefulness of trust information on text -- otherwise, all comparison are very subjective. In our WikiSym 08 paper, we measure various parameters of the "trust" coloring we compute, including:

- Recall of deletions. Only 3.4% of text is in the lower half of trust values, yet this is 66% of the text that is deleted in the very next revision. - Precision of deletions. Text is the bottom half of trust values has probability 33% of being deleted in the next revision, agaist a probability of 1.9% for general text. The deletion probability raises to 62% for text in the bottom 20% of trust values. - We study the correlation between the trust of a word, sampled at random in all revisions, and the future lifespan of a word (correcting for the finite horizon effect due to the finite number of revisions in each article), showing positive correlation.

Some aspects are not captured by the above measures:

- We ensured that every "tampering" (including cut-and-paste) are reflected in the trust coloring, so it is hard to subvert the algorithm (does "age" provide this?). - We ensured the whole scheme is robust wrt attacks (see the various papers if you are interested).

I fully believe that it should not be hard to improve on our system re. the above measurements. And I fully agree that the "reputation" we compute is essentially an internal parameter of the system, and does not really constitute a good summary of a person's overall Wikipedia contribution; for this and other reasons we do not display it.

Luca

A simply objective challenge for any predictive coloring system would

...

be to use them in the following experimental procedure:

Take a dump of Wikipedia up a year old, use this as the underlying

knowledge for the systems.

Make several random selections of articles and include the newer

revisions not included in the initial set up to 6 months old. Call these the test sets.

The predictive coloring system should then take each revision in a

test set in time order and predict if it will be reverted (Within X time?).

The actual edits up to now should be analyzed to determined which

changes actually were reverted and when.

The final score will be the false positive and false negative rates. So long as e assume that the existing editing practices are not too bad we should find that the best predictive coloring system would generally tend to minimize these rates. _______________________________________________ Wikipedia-l mailing list Wikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikipedia-l

Luca de Alfaro

2:35 a.m.

Maury,

perhaps I can help explain the behavior you saw in the UCSC system (I am one of the developers). New text is always somewhat orange, to signal to visitors that it has not yet been fully reviewed. The higher the reputation, the lighter the shade of orange, but orange it still is (I have no idea of how high was your computed reputation when you started writing that article).

Text background becomes white when other people revise it without drastically changing it: this indicates consensus. In our more recent code version, we also have a "vote" button; using this, text can more speedily gain trust without need for many revisions to occur. In a live experiment, where people can click on the vote button, I presume the trust of the text would raise more rapidly. Note that the code prevents double voting, or creating sock-puppet accounts to vote, etc etc.

So I don't think based on what you say that the system is tripping over diffs. It is simply considering new text less trusted, and more revised text more trusted, which is what we wanted. It appears however we don't do a very good job on the web site describing the algorithm (I guess we put most of the description work in writing the papers... we will try to improve the web site).

We don't measure "edit work" in number of edits, but in number of words changed. As you say, for our system, changing 1000 words in separate edits is the same (provided the edits are all kept, i.e., not reverted) as providing a single 1000-word contribution. We thought of giving a larger prize to larger contributions: precisely, of making the reputation increment proportional to n^a, where n is the number of words, and a > 1. This did not work well for the Wikipedia, because it ended up not rewarding enough the work of the many editors, who clean and polish the articles, thus making many small edits. Technically it would be trivial to change the code to include such a non-linear reward scheme (to adopt rewards proportional to n^a rather than n); whether it is desirable, I have no idea. It does not lead to better quantitative performance of the system, i.e., the resulting trust is not better at predicting future text deletions.

Luca

...

The USCS system did work, but gave me odd results. Apparently I have a very bad reputation, because when I look in the History at the first versions, which I wrote in entirety, it colored it all yellow!

Newer versions of the same articles had much more white, even though huge portions of the text were still from the origial. This may be due to diff problems -- I consider diff to be largely random in effectiveness, sometimes it works, but othertimes a single whitespace change, especially vertical, will make it think the entire article was edited.

My guess is that the system is tripping over diffs like this, and thus considering the article to have been re-written by another editor. Since this has happened, MY reputation goes down, or so I understand it.

I don´t think this system could possibly work if based on wiki's diffs. If its going to work it´s going to need to use a much more reliable system.

Another problem I see with it is that it will rank an author who´s contributions are 1000 unchanged comma inserts to be as reliable as an author who created a perfect 1000 character article (or perhaps rate the first even higher). There should be some sort of length bias, if an author makes a big edit, out of character, that´s important to know.

Maury

Wikipedia-l mailing list Wikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikipedia-l

J.L.W.S. The Special One

4:59 a.m.

How would the system handle a paragraph full of high quality, well-referenced and well-organised content contributed by an editor A, that is thoroughly copyedited by an editor B? Would editor A be deemed less trustworthy when his prose is thoroughly copyedited?

2008/11/25, Luca de Alfaro luca@dealfaro.org:

...

Maury,

perhaps I can help explain the behavior you saw in the UCSC system (I am one of the developers). New text is always somewhat orange, to signal to visitors that it has not yet been fully reviewed. The higher the reputation, the lighter the shade of orange, but orange it still is (I have no idea of how high was your computed reputation when you started writing that article).

Text background becomes white when other people revise it without drastically changing it: this indicates consensus. In our more recent code version, we also have a "vote" button; using this, text can more speedily gain trust without need for many revisions to occur. In a live experiment, where people can click on the vote button, I presume the trust of the text would raise more rapidly. Note that the code prevents double voting, or creating sock-puppet accounts to vote, etc etc.

So I don't think based on what you say that the system is tripping over diffs. It is simply considering new text less trusted, and more revised text more trusted, which is what we wanted. It appears however we don't do a very good job on the web site describing the algorithm (I guess we put most of the description work in writing the papers... we will try to improve the web site).

We don't measure "edit work" in number of edits, but in number of words changed. As you say, for our system, changing 1000 words in separate edits is the same (provided the edits are all kept, i.e., not reverted) as providing a single 1000-word contribution. We thought of giving a larger prize to larger contributions: precisely, of making the reputation increment proportional to n^a, where n is the number of words, and a > 1. This did not work well for the Wikipedia, because it ended up not rewarding enough the work of the many editors, who clean and polish the articles, thus making many small edits. Technically it would be trivial to change the code to include such a non-linear reward scheme (to adopt rewards proportional to n^a rather than n); whether it is desirable, I have no idea. It does not lead to better quantitative performance of the system, i.e., the resulting trust is not better at predicting future text deletions.

Luca

...
The USCS system did work, but gave me odd results. Apparently I have a very bad reputation, because when I look in the History at the first versions, which I wrote in entirety, it colored it all yellow!

Newer versions of the same articles had much more white, even though huge portions of the text were still from the origial. This may be due to diff problems -- I consider diff to be largely random in effectiveness, sometimes it works, but othertimes a single whitespace change, especially vertical, will make it think the entire article was edited.

My guess is that the system is tripping over diffs like this, and thus considering the article to have been re-written by another editor. Since this has happened, MY reputation goes down, or so I understand it.

I don´t think this system could possibly work if based on wiki's diffs. If its going to work it´s going to need to use a much more reliable system.

Another problem I see with it is that it will rank an author who´s contributions are 1000 unchanged comma inserts to be as reliable as an author who created a perfect 1000 character article (or perhaps rate the first even higher). There should be some sort of length bias, if an author makes a big edit, out of character, that´s important to know.

Maury

Wikipedia-l mailing list Wikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikipedia-l

Wikipedia-l mailing list Wikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikipedia-l

-- Written with passion, J.L.W.S. The Special One

Luca de Alfaro

8:03 p.m.

Very good question. Author A would still get some reputation gain, due to the way we compute the edit distance. A would gain less reputation than if his contribution survived intact.

Specifically, assume that there is a revision r0, A adds an n-words piece of text making it become r1, and B then rewrites A's contribution, obtaining r2. Due to the way we compute edit distances, we have:

d(r0, r1) = n d(r1, r2) = n/2 d(r0,r2) = n

So the quality of A's contribution is q = (d(r0,r2) - d(r1,r2)) / d(r0,r1) = (n - n/2) / n = 1/2 > 0

and A gets half of the reputation gain it would have gotten, had hes text not been rewritten. The reason d(r1, r2) = n/2 is that our algorithm distinguishes, when giving credit, replacements from insertions / removals. Note that people who insert pure spam have their contribution removed, not rewritten, so even with the above lenient treatment of replacements, spammers do not end up gaining reputation.

When developing the reputation algorithms, we (Bo and I) went over hundreds of revisions which were marked with negative quality, checking that the author really deserved the reputation loss they got by authoring that revision. We tweaked the algorithms as a consequence; the treatment of replacements described above was introduced precisely to give reputation to authors, even in the face of wordsmithing or copy-editing.

Luca

On Mon, Nov 24, 2008 at 7:59 PM, J.L.W.S. The Special One < hildanknight@gmail.com> wrote:

...

How would the system handle a paragraph full of high quality, well-referenced and well-organised content contributed by an editor A, that is thoroughly copyedited by an editor B? Would editor A be deemed less trustworthy when his prose is thoroughly copyedited?

Gregory Maxwell

5:14 a.m.

On Mon, Nov 24, 2008 at 8:35 PM, Luca de Alfaro luca@dealfaro.org wrote: [snip]

...

So I don't think based on what you say that the system is tripping over diffs.

For example: I can't figure out why the text in the image caption is colored here http://wiki-trust.cse.ucsc.edu/index.php/Digital_room_correction

I couldn't initially figure out why *anything* above the external link section was colored… though the inability to diff contributed to that.

On Mon, Nov 24, 2008 at 8:22 PM, Luca de Alfaro luca@dealfaro.org wrote:

...

I agree with Gregory that it is very useful to quantify the usefulness of trust information on text -- otherwise, all comparison are very subjective. In our WikiSym 08 paper, we measure various parameters of the "trust" coloring we compute, including:

Recall of deletions. Only 3.4% of text is in the lower half of trust

values, yet this is 66% of the text that is deleted in the very next revision.

Precision of deletions. Text is the bottom half of trust values has

probability 33% of being deleted in the next revision, agaist a probability of 1.9% for general text. The deletion probability raises to 62% for text in the bottom 20% of trust values.

We study the correlation between the trust of a word, sampled at random

in all revisions, and the future lifespan of a word (correcting for the finite horizon effect due to the finite number of revisions in each article), showing positive correlation.

[snip]

These performance metrics are better than I would have guessed from browsing through the output. How does the color mapping reflect the trust values? Basically when I use it I see a *lot* of colored things which are perfectly fine. At least for me, the difference between shades is far less cognitively significant than colored vs non-colored, so that may be the source of my confusion.

Have you compared your system to a simple toy trust metric? I'd propose "revisions by users in their first week and before their first 7 (?) edits are untrusted". This reflects the existing automatic trust system on the site (auto-confirmation), and also reflects the a type of trust checking applied manually by editors. I think thats the bar any more sophisticated trust metric needs to outperform.

Thank you so much for your response!

Luca de Alfaro

8:26 p.m.

This comment is very interesting, and it points out that we most likely are using too dark shades of orange (this is customizable in php btw). We have 10 equally-spaced shades of orange, from the darkest (trust 0) to pure white (trust 9). According to the current coloring scheme, even text with trust 7, which has been revised etc, gets some visible (two steps down from pure white) shade of orange. It might be visually better to have 8,9 both as pure white, and lighten the shade of levels 6,7.

In the new codebase, we have increased the speed at which text gains trust (we noticed also it was a bit too orange in that old demo). The vote button also helps text gain trust much more quickly, as people can just click there to validate the text, rather than having to do an edit. People can only raise the trust of text up to their own reputation level (which also goes from 0 to 9), so that spammers cannot enter an edit, then use sock-puppets to make the orange coloring disappear.

Yes, we did compare the results with a trust system based purely on text age; see Figure 7 of http://www.soe.ucsc.edu/~luca/papers/08/wikisym08-trust.html On the Wikipedia, people are so dedicated that most pages are visited regularly, so age of text is a good indicator of text quality. Using a reputation system as we do enables us to assign medium trust (trust level 5) to the brand-new contributions by high-reputation authors (which means, in practice, anyone with a moderate history of good edits). If you go simply by text age, then you would end up assigning low trust to these contributions when they are brand new. Thus, the use of a reputation system enables us to assign _more_ trust to text. This is is why Figure 7 shows that the trust based on a reputation system is more precise: low trust is assigned more sparingly, and it is a more precise predictor of future deletions.

In summary, on the Wikipedia the use of a reputation system enables us to:

- Assign more trust to new text by good reputation authors - Make it hard for spammers to cause their contributions to become fully trusted.

As Gregory points out, though, at least before the introduction of the vote button, our coloring had too much text in the lighter shades of orange. Whether this would remain a problem even after people can vote for the correctness of text, I don't know. We believe the vote button is very useful in enabling good text gain trust quickly.

Also note that in many less-followed wikis than the Wikipedia, many pages remain unchecked for relatively long periods, so using text age there would not necessarily work well -- but I don't have data on hand to back this claim.

Luca

...

These performance metrics are better than I would have guessed from browsing through the output. How does the color mapping reflect the trust values? Basically when I use it I see a *lot* of colored things which are perfectly fine. At least for me, the difference between shades is far less cognitively significant than colored vs non-colored, so that may be the source of my confusion.

Have you compared your system to a simple toy trust metric? I'd propose "revisions by users in their first week and before their first 7 (?) edits are untrusted". This reflects the existing automatic trust system on the site (auto-confirmation), and also reflects the a type of trust checking applied manually by editors. I think thats the bar any more sophisticated trust metric needs to outperform.

Thank you so much for your response! _______________________________________________ Wikipedia-l mailing list Wikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikipedia-l

5876

Age (days ago)

5882

Last active (days ago)

wikipedia-l@lists.wikimedia.org

10 comments

7 participants

tags (0)

participants (7)

avani＠cs.umn.edu
Gregory Maxwell
J.L.W.S. The Special One
Joseph Reagle
Luca de Alfaro
Maury Markowitz
michael west