On Sun, Nov 23, 2008 at 9:03 AM, Maury Markowitz maury.markowitz@gmail.com wrote:
On Wed, Nov 19, 2008 at 2:23 PM, avani@cs.umn.edu wrote:
We hope to get your valuable feedback on these interfaces and how Wikipedia article quality can be improved.
Given the older snapshots, I selected older articles that I had started, NuBUS and ARCNET.
The "time based" system from UMN did not work at all, every search resulted in a page not found.
The UMN system intentionally included only a small number (70?) articles. This is why you needed to use the random page function to browse among them.
This doesn't reflect any short coming of the system, but it most likely just reflects the limits of computational resources they were working under.
[snip]
Newer versions of the same articles had much more white, even though huge portions of the text were still from the origial. This may be due to diff problems -- I consider diff to be largely random in effectiveness, sometimes it works, but othertimes a single whitespace change, especially vertical, will make it think the entire article was edited.
Yes, I had exactly the same experience with the USCS system: Different coloring for text I'd added in same edit which created the article. Quite inscrutable.
[snip]
Another problem I see with it is that it will rank an author who´s contributions are 1000 unchanged comma inserts to be as reliable as an author who created a perfect 1000 character article (or perhaps rate the first even higher). There should be some sort of length bias, if an author makes a big edit, out of character, that´s important to know.
For the articles it covered I found the UMN system to be more usable: It's output was more explicable, and the signal to noise ratio was just better. This may be partially due to bugs in the USCS history analysis, and different a different choice in coloring thresholds (USCS seemed to color almost everything, removing the usefulness of color as something to draw my attention).
Even so, I'm distrustful of "reputation" as an automated metric. Reputation is a fuzzy thing (consider your comma example), but time is just a straight forward metric which is much easier to get right. Your tireless and unreverted editing of external links tells me very little about your ability to make a reliable edit to the intro of an article, ... or at least very little that I didn't already know by merely knowing if your account was brand new or not. (New accounts are more likely to be used by inexperienced and ill-motivated persons)
I believe a metric applied correctly, consistently, and understandably is just going to be more useful than a metric which considers more data but is also subject to more noise. The differential performance between these two systems has done nothing but confirm my suspicions in this regard.
A simply objective challenge for any predictive coloring system would be to use them in the following experimental procedure:
* Take a dump of Wikipedia up a year old, use this as the underlying knowledge for the systems. * Make several random selections of articles and include the newer revisions not included in the initial set up to 6 months old. Call these the test sets. * The predictive coloring system should then take each revision in a test set in time order and predict if it will be reverted (Within X time?). * The actual edits up to now should be analyzed to determined which changes actually were reverted and when.
The final score will be the false positive and false negative rates. So long as e assume that the existing editing practices are not too bad we should find that the best predictive coloring system would generally tend to minimize these rates.