On Sun, Nov 23, 2008 at 9:03 AM, Maury Markowitz
On Wed, Nov 19, 2008 at 2:23 PM,
We hope to get your valuable feedback on these
interfaces and how Wikipedia
article quality can be improved.
Given the older snapshots, I selected older articles that I had
started, NuBUS and ARCNET.
The "time based" system from UMN did not work at all, every search
resulted in a page not found.
The UMN system intentionally included only a small number (70?)
articles. This is why you needed to use the random page function to
browse among them.
This doesn't reflect any short coming of the system, but it most
likely just reflects the limits of computational resources they were
Newer versions of the same articles had much more
white, even though
huge portions of the text were still from the origial. This may be due
to diff problems -- I consider diff to be largely random in
effectiveness, sometimes it works, but othertimes a single whitespace
change, especially vertical, will make it think the entire article was
Yes, I had exactly the same experience with the USCS system: Different
coloring for text I'd added in same edit which created the article.
Another problem I see with it is that it will rank an
contributions are 1000 unchanged comma inserts to be as reliable as an
author who created a perfect 1000 character article (or perhaps rate
the first even higher). There should be some sort of length bias, if
an author makes a big edit, out of character, that´s important to
For the articles it covered I found the UMN system to be more usable:
It's output was more explicable, and the signal to noise ratio was
just better. This may be partially due to bugs in the USCS history
analysis, and different a different choice in coloring thresholds
(USCS seemed to color almost everything, removing the usefulness of
color as something to draw my attention).
Even so, I'm distrustful of "reputation" as an automated metric.
Reputation is a fuzzy thing (consider your comma example), but time is
just a straight forward metric which is much easier to get right. Your
tireless and unreverted editing of external links tells me very little
about your ability to make a reliable edit to the intro of an article,
... or at least very little that I didn't already know by merely
knowing if your account was brand new or not. (New accounts are more
likely to be used by inexperienced and ill-motivated persons)
I believe a metric applied correctly, consistently, and understandably
is just going to be more useful than a metric which considers more
data but is also subject to more noise. The differential performance
between these two systems has done nothing but confirm my suspicions
in this regard.
A simply objective challenge for any predictive coloring system would
be to use them in the following experimental procedure:
* Take a dump of Wikipedia up a year old, use this as the underlying
knowledge for the systems.
* Make several random selections of articles and include the newer
revisions not included in the initial set up to 6 months old. Call
these the test sets.
* The predictive coloring system should then take each revision in a
test set in time order and predict if it will be reverted (Within X
* The actual edits up to now should be analyzed to determined which
changes actually were reverted and when.
The final score will be the false positive and false negative rates.
So long as e assume that the existing editing practices are not too
bad we should find that the best predictive coloring system would
generally tend to minimize these rates.