[Wikipedia-l] Study on Interfaces to Improving Wikipedia Quality

Sun Nov 23 14:44:40 UTC 2008

On Sun, Nov 23, 2008 at 9:03 AM, Maury Markowitz
<maury.markowitz at gmail.com> wrote:
> On Wed, Nov 19, 2008 at 2:23 PM,  <avani at cs.umn.edu> wrote:
>> We hope to get your valuable feedback on these interfaces and how Wikipedia
>> article quality can be improved.
>
> Given the older snapshots, I selected older articles that I had
> started, NuBUS and ARCNET.
>
> The "time based" system from UMN did not work at all, every search
> resulted in a page not found.

The UMN system intentionally included only a small number (70?)
articles. This is why you needed to use the random page function to
browse among them.

This doesn't reflect any short coming of the system, but it most
likely just reflects the limits of computational resources they were
working under.

[snip]
> Newer versions of the same articles had much more white, even though
> huge portions of the text were still from the origial. This may be due
> to diff problems -- I consider diff to be largely random in
> effectiveness, sometimes it works, but othertimes a single whitespace
> change, especially vertical, will make it think the entire article was
> edited.

Yes, I had exactly the same experience with the USCS system: Different
coloring for text I'd added in same edit which created the article.
Quite inscrutable.

[snip]
> Another problem I see with it is that it will rank an author who´s
> contributions are 1000 unchanged comma inserts to be as reliable as an
> author who created a perfect 1000 character article (or perhaps rate
> the first even higher). There should be some sort of length bias, if
> an author makes a big edit, out of character, that´s important to
> know.

For the articles it covered I found the UMN system to be more usable:
It's output was more explicable, and the signal to noise ratio was
just better.  This may be partially due to bugs in the USCS history
analysis, and different a different choice in coloring thresholds
(USCS seemed to color almost everything, removing the usefulness of
color as something to draw my attention).

Even so, I'm distrustful of "reputation" as an automated metric.
Reputation is a fuzzy thing (consider your comma example), but time is
just a straight forward metric which is much easier to get right. Your
tireless and unreverted editing of external links tells me very little
about your ability to make a reliable edit to the intro of an article,
... or at least very little that I didn't already know by merely
knowing if your account was brand new or not. (New accounts are more
likely to be used by inexperienced and ill-motivated persons)

I believe a metric applied correctly, consistently, and understandably
is just going to be more useful than a metric which considers more
data but is also subject to more noise. The differential performance
between these two systems has done nothing but confirm my suspicions
in this regard.

A simply objective challenge for any predictive coloring system would
be to use them in the following experimental procedure:

* Take a dump of Wikipedia up a year old, use this as the underlying
knowledge for the systems.
* Make several random selections of articles and include the newer
revisions not included in the initial set up to 6 months old. Call
these the test sets.
* The predictive coloring system should then take each revision in a
test set in time order and predict if it will be reverted (Within X
time?).
* The actual edits up to now should be analyzed to determined which
changes actually were reverted and when.

The final score will be the false positive and false negative rates.
So long as e assume that the existing editing practices are not too
bad we should find that the best predictive coloring system would
generally tend to minimize these rates.