On Thu, Aug 20, 2009 at 2:10 PM, Anthonywikimail@inbox.org wrote:
On Thu, Aug 20, 2009 at 1:55 PM, Nathan nawrich@gmail.com wrote:
My point (which might still be incorrect, of course) was that an analysis based on 30,000 randomly selected pages was more informative about the English Wikipedia than 100 articles about serving United States Senators.
Any automated method of finding vandalism is doomed to failure. I'd say its informativeness was precisely zero.
Greg's analysis, on the other hand, was informative, but it was targeted at a much different question than Robert's.
"if one chooses a random page from Wikipedia right now, what is the probability of receiving a vandalized revision" The best way to answer that question would be with a manually processed random sample taken from a pre-chosen moment in time. As few as 1000 revisions would probably be sufficient, if I know anything about statistics, but I'll let someone with more knowledge of statistics verify or refute that. The results will depend heavily on one's definition of "vandalism", though.
Only in dreadfully obvious cases can you look at a revision by itself and know it contains vandalism. If the goal is really to characterize whether any vandalism has persisted in an article from any time in the past, then one really needs to look at the full edit history to see what has been changed / removed over time.
Even at the level of randomly sampling 1000 revisions, doing an real evaluation of the full history is likely to be impractical for any manual process.
If however you restrict yourself to asking whether 1000 edits contributed vandalism, then you have a relatively manageable task, and one that is more closely analogous to the technical program I set up. If it helps one can think of what I did as trying to characterize reverts and detect the persistence of "new vandalism" rather than "vandalism" in general. And of course, only "new vandalism" could be fixed by an immediate rollback / revert anyway.
Qualitatively I tend to think that vandalism that has persisted through many intervening revisions is in a rather different category than new vandalism. Since people rarely look at or are aware of an articles' ancient past, such persistent vandalism is at that point little different than any other error in an article. It is something to be fixed, but you won't usually be able to recognize it as a malicious act.
On Thu, Aug 20, 2009 at 12:38 PM, Jimmy Wales jwales@wikia-inc.com wrote:
Is there a possibility of re-running the numbers to include traffic weightings?
definitely should be done
Does anyone have a nice comprehensive set of page traffic aggregated at say a month level? The raw data used by stats.grok.se, etc. is binned hourly which opens one up to issues of short-term fluctuations, but I'm not at all interested in downloading 35 GB of hourly files just to construct my own long-term averages.
I would hypothesize from experience that if we adjust the "random page" selection to account for traffic (to get a better view of what people are actually seeing) we would see slightly different results.
I think we'd see drastically different results.
If I had to make a prediction, I'd expect one might see numerically higher rates of vandalism and shorter average durations, but otherwise qualitatively similar results given the same methodology. I agree though that it would be worth doing the experiment.
I think we would see a lot less (percentagewise) vandalism that persists for a really long time for precisely the reason you identified: most vandalism that lasts a long time, lasts a long time because it is on obscure pages that no one is visiting.
Agreed. On the other hand, I think we'd also see that pages with more traffic are more likely to be vandalized.
Of course, this assumes a valid methodology. Using "admin rollback, the undo function, the revert bots, various editing tools, and commonly used phrases like "rv", "rvv", etc." to find vandalism is heavily skewed toward vandalism that doesn't last very long (or at least doesn't last very many edits). It's basically useless.
Yes, as I acknowledged above, "new vandalism". My personal interest is also skewed in that direction. If you don't like it and don't find it useful, feel free to ignore me and/or do your own analysis. Vandalism that has persisted through many revisions is a qualitatively different critter than most new vandalism. It's usually hard to identify, even by a manual process, and is unlikely to be fixed except through the normal editoral process of review, fact-checking, and revision. When vandalism is "new" people are at least paying attention to it in particular, and all vandalism starts out that way. Perhaps it would be more useful if you think of this work as a characterization of revert statistics?
Anyway, I provided my data point and described what I did so others could judge it for themselves. Regardless of your opinion, it addressed an issue of interest to me, and I would hope others also find some useful insight in it.
-Robert Rohde