I am supposed to be taking a wiki-vacation to finish my PhD thesis and find a job for next year. However, this afternoon I decided to take a break and consider an interesting question recently suggested to me by someone else:
When one downloads a dump file, what percentage of the pages are actually in a vandalized state?
This is equivalent to asking, if one chooses a random page from Wikipedia right now, what is the probability of receiving a vandalized revision?
Understanding what fraction of Wikipedia is vandalized at any given instant is obviously of both practical and public relations interest. In addition it bears on the motivation for certain development projects like flagged revisions. So, I decided to generate a rough estimate.
For the purposes of making an estimate I used the main namespace of the English Wikipedia and adopted the following operational approximations: I considered that "vandalism" is that thing which gets reverted, and that "reverts" are those edits tagged with "revert, rv, undo, undid, etc." in the edit summary line. Obviously, not all vandalism is cleanly reverted, and not all reverts are cleanly tagged. In addition, some things flagged as reverts aren't really addressing what we would conventionally consider to be vandalism. Such caveats notwithstanding, I have had some reasonable success with using a revert heuristic in the past. With the right keywords one can easily catch the standardized comments created by admin rollback, the undo function, the revert bots, various editing tools, and commonly used phrases like "rv", "rvv", etc. It won't be perfect, but it is a quick way of getting an automated estimate. I would usually expect the answer I get in this way to be correct within an order of magnitude, and perhaps within a factor of a few, though it is still just a crude estimate.
I analyzed the edit history up to the mid-June dump for a sample 29,999 main namespace pages (sampling from everything in main including redirects). This included 1,333,829 edits, from which I identified 102,926 episodes of reverted "vandalism". As a further approximation, I assumed that whenever a revert occurred, it applied to the immediately preceding edit and any additional consecutive changes by the same editor (this is how admin rollback operates, but is not necessarily true of tools like undo).
With those assumptions, I then used the timestamps on my identified intervals of vandalism to figure out how much time each page had spent in a vandalized state. Over the entire history of Wikipedia, this sample of pages was vandalized during 0.28% of its existence. Or, more relevantly, focusing on just this year vandalism was present 0.21% of the time, which suggests that one should expect 0.21% of mainspace pages in any recent enwiki dump will be in a vandalized state (i.e. 1 in 480).
(Note that since redirects represent 55% of the main namespace and are rarely vandalized, one could argue that 0.37% [1 in 270] would be a better estimate for the portion of actual articles that are in a vandalized condition at any given moment.)
I also took a look at the time distribution of vandalism. Not surprisingly, it has a very long tail. The median time to revert over the entire history is 6.7 minutes, but the mean time to revert is 18.2 hours, and my sample included one revert going back 45 months (though examples of such very long lags also imply the page had gone years without any edits, which would imply an obscure topic that was also almost never visited). In the recent period these factors becomes 5.2 minutes and 14.4 hours for the median and mean respectively. The observation that nearly 50% of reverts are occurring in 5 minutes or less is a testament to the efficient work of recent changes reviewers and watchlists.
Unfortunately the 5% of vandalism that persists longer than 35 hours is responsible for 90% of the actual vandalism a visitor is likely to encounter at random. Hence, as one might guess, it is the vandalism that slips through and persists the longest that has the largest practical effect.
It is also worth noting that the prevalence figures for February-May of this year are slightly lower than at any time since 2006. There is also a drop in the mean duration of vandalism coupled to a slight increase in the median duration. However, these effects mostly disappear if we limit our considerations to only vandalism events lasting 1 month or shorter. Hence those changes may be in significant part linked to cut-off biasing from longer-term vandalism events that have yet to be identified. The ambiguity in the change from earlier in the year is somewhat surprising as the AbuseFilter was launched in March and was intended to decrease the burden of vandalism. One might speculate that the simple vandalism amenable to the AbuseFilter was already being addressed quickly in nearly all cases and hence its impact on the persistence of vandalism may already have been fairly limited.
I've posted some summary data on the wiki at:
http://en.wikipedia.org/wiki/Wikipedia:Vandalism_statistics
Given the nature of the approximations I made in doing this analysis I suspect it is more likely that I have somewhat underestimated the vandalism problem rather than overestimated it, but as I said in the beginning I'd like to believe I am in the right ballpark. If that's true, I personally think that having less than 0.5% of Wikipedia be vandalized at any given instant is actually rather comforting. It's not a perfect number, but it would suggest that nearly everyone still gets to see Wikipedia as intended rather than in a vandalized state. (Though to be fair I didn't try to figure out if the vandalism occurred in more frequently visited parts or not.)
Unfortunately, that's it for now as I need to get back to my thesis / job search.
-Robert Rohde
Robert, thanks for this. I have long wanted that number: it is really interesting.
-----Original Message----- From: Robert Rohde rarohde@gmail.com
Date: Thu, 20 Aug 2009 03:06:06 To: Wikimedia Foundation Mailing Listfoundation-l@lists.wikimedia.org; English Wikipediawikien-l@lists.wikimedia.org Cc: Sean Moss-Pultzsean@openmoko.com; suh@parc.com Subject: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
I am supposed to be taking a wiki-vacation to finish my PhD thesis and find a job for next year. However, this afternoon I decided to take a break and consider an interesting question recently suggested to me by someone else:
When one downloads a dump file, what percentage of the pages are actually in a vandalized state?
This is equivalent to asking, if one chooses a random page from Wikipedia right now, what is the probability of receiving a vandalized revision?
Understanding what fraction of Wikipedia is vandalized at any given instant is obviously of both practical and public relations interest. In addition it bears on the motivation for certain development projects like flagged revisions. So, I decided to generate a rough estimate.
For the purposes of making an estimate I used the main namespace of the English Wikipedia and adopted the following operational approximations: I considered that "vandalism" is that thing which gets reverted, and that "reverts" are those edits tagged with "revert, rv, undo, undid, etc." in the edit summary line. Obviously, not all vandalism is cleanly reverted, and not all reverts are cleanly tagged. In addition, some things flagged as reverts aren't really addressing what we would conventionally consider to be vandalism. Such caveats notwithstanding, I have had some reasonable success with using a revert heuristic in the past. With the right keywords one can easily catch the standardized comments created by admin rollback, the undo function, the revert bots, various editing tools, and commonly used phrases like "rv", "rvv", etc. It won't be perfect, but it is a quick way of getting an automated estimate. I would usually expect the answer I get in this way to be correct within an order of magnitude, and perhaps within a factor of a few, though it is still just a crude estimate.
I analyzed the edit history up to the mid-June dump for a sample 29,999 main namespace pages (sampling from everything in main including redirects). This included 1,333,829 edits, from which I identified 102,926 episodes of reverted "vandalism". As a further approximation, I assumed that whenever a revert occurred, it applied to the immediately preceding edit and any additional consecutive changes by the same editor (this is how admin rollback operates, but is not necessarily true of tools like undo).
With those assumptions, I then used the timestamps on my identified intervals of vandalism to figure out how much time each page had spent in a vandalized state. Over the entire history of Wikipedia, this sample of pages was vandalized during 0.28% of its existence. Or, more relevantly, focusing on just this year vandalism was present 0.21% of the time, which suggests that one should expect 0.21% of mainspace pages in any recent enwiki dump will be in a vandalized state (i.e. 1 in 480).
(Note that since redirects represent 55% of the main namespace and are rarely vandalized, one could argue that 0.37% [1 in 270] would be a better estimate for the portion of actual articles that are in a vandalized condition at any given moment.)
I also took a look at the time distribution of vandalism. Not surprisingly, it has a very long tail. The median time to revert over the entire history is 6.7 minutes, but the mean time to revert is 18.2 hours, and my sample included one revert going back 45 months (though examples of such very long lags also imply the page had gone years without any edits, which would imply an obscure topic that was also almost never visited). In the recent period these factors becomes 5.2 minutes and 14.4 hours for the median and mean respectively. The observation that nearly 50% of reverts are occurring in 5 minutes or less is a testament to the efficient work of recent changes reviewers and watchlists.
Unfortunately the 5% of vandalism that persists longer than 35 hours is responsible for 90% of the actual vandalism a visitor is likely to encounter at random. Hence, as one might guess, it is the vandalism that slips through and persists the longest that has the largest practical effect.
It is also worth noting that the prevalence figures for February-May of this year are slightly lower than at any time since 2006. There is also a drop in the mean duration of vandalism coupled to a slight increase in the median duration. However, these effects mostly disappear if we limit our considerations to only vandalism events lasting 1 month or shorter. Hence those changes may be in significant part linked to cut-off biasing from longer-term vandalism events that have yet to be identified. The ambiguity in the change from earlier in the year is somewhat surprising as the AbuseFilter was launched in March and was intended to decrease the burden of vandalism. One might speculate that the simple vandalism amenable to the AbuseFilter was already being addressed quickly in nearly all cases and hence its impact on the persistence of vandalism may already have been fairly limited.
I've posted some summary data on the wiki at:
http://en.wikipedia.org/wiki/Wikipedia:Vandalism_statistics
Given the nature of the approximations I made in doing this analysis I suspect it is more likely that I have somewhat underestimated the vandalism problem rather than overestimated it, but as I said in the beginning I'd like to believe I am in the right ballpark. If that's true, I personally think that having less than 0.5% of Wikipedia be vandalized at any given instant is actually rather comforting. It's not a perfect number, but it would suggest that nearly everyone still gets to see Wikipedia as intended rather than in a vandalized state. (Though to be fair I didn't try to figure out if the vandalism occurred in more frequently visited parts or not.)
Unfortunately, that's it for now as I need to get back to my thesis / job search.
-Robert Rohde
_______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
On Thu, Aug 20, 2009 at 6:06 AM, Robert Rohderarohde@gmail.com wrote: [snip]
When one downloads a dump file, what percentage of the pages are actually in a vandalized state?
Although you don't actually answer that question, you answer a different question:
[snip]
approximations: I considered that "vandalism" is that thing which gets reverted, and that "reverts" are those edits tagged with "revert, rv, undo, undid, etc." in the edit summary line. Obviously, not all vandalism is cleanly reverted, and not all reverts are cleanly tagged.
Which is interesting too, but part of the problem with calling this a measure of vandalism is that it isn't really, and we don't really have a good handle on how solid an approximation it is beyond gut feelings and arm-waving.
The study of Wikipedia activity is a new area of research, not something that has been studied for decades. Not only do we not know many things about Wikipedia, but we don't know many things about how to know things about Wikipedia.
There must be ways to get a better understanding, but we many not know of them and the ones we do know of are not always used. For example, we could increase our confidence in this type of proxy-measure by taking a random subset of that data and having humans classify it based on some agreed-on established criteria. By performing the review process many times we could get a handle on the typical error of both the proxy-metric and the meta-review.
The risk here is that people will misunderstand these shorthand metrics as the real-deal and the risk is increased when we encourage it by using language which suggests that the simplistic understanding is the correct one. IMO, highly uncertain and/or outright wrong information is worse than not knowing when you aren't aware of the reliability of the information.
We can't control how the press chooses to report on research, but when we actively encourage misunderstandings by playing up the significance or generality of our research our behaviour is unethical. Vigilance is required.
This risk of misinformation is increased many-fold in comparative analysis, where factors like time are plotted against indicators because we often miss confounding variables (http://en.wikipedia.org/wiki/Confounding).
Stepping away from your review for a moment, because it wasn't primarily a comparative one, I'd like to point out some general points:
For example, If research finds that edits are more frequently reverted over time is this because there has been a change in the revision decision process or have articles become better and more complete over time and have edits to long and high quality articles always been more likely to be reverted? Both are probably true, but how does the contribution break down?
There are many other possibly significant confounding variables. Probably many more than any of us have thought of yet.
I've always been of the school of thought that we do research to produce understanding, not just generate numbers and "Wikipedia becomes more complete over time, less work for new people to do" is a pretty different understanding from "Wikipedia increasing hostile towards new contributors" are pretty different understandings but both may be supported by the same data at least until you've controlled for many factors.
Another example— because of the scale of Wikipedia we must resort to proxy-metrics. We can't directly measure vandalism, but we can measure how often someone adds "is gay" over time. Proxy-metrics are powerful tools but can be misleading. If we're trying to automatically identify vandalism for a study (either to include it or exclude it) we have the risk that the vandals are adapting to automatic identification: If you were using "is gay" as a measure of vandalism over time you might conclude that vandalism is decreasing when in reality "cluebot" is performing the same kind of analysis for its automatic vandalism suppression and the vandals have responded by vandalizing in forms that can't be automatically identified, such as by changing dates to incorrect values.
Or, keeping the goal of understanding in mind, sometimes the measurements can all be right but a lack of care and consideration can still cause people to draw the wrong conclusions. For example, English Wikipedia has adopted a much stronger policy about citations in articles about living people than it once had. It is *intentionally* more difficult to contribute to those articles especially for new contributors who do not know the rules then it once was.
Going back to your simple study now: The analysis of vandalism duration and its impact on readers makes an assumption about readership which we know to be invalid. You're assuming a uniform distribution of readership: That readers are just as likely to read any random article. But we know that the actual readership follows a power-law (long-tail) distribution. Because of the failure to consider traffic levels we can't draw conclusions on how much vandalism readers are actually exposed to.
Interestingly— you've found a power-law distribution in vandalism lifetime. Is it possible that readership and vandalism life are correlated, that more widely read articles tend to get reverted faster? That doesn't sound unreasonable to me and if it's true it means that readers are exposed to far less vandalism than a uniform model would suggest.
In any case— I don't say any of this to criticize the mechanics of your work. I'm able to point these things out because you were clear about what you measured, more so than some other analysis has been (including my own, at times). But I do think that it's important that we are careful to not describe our work in ways that will cause laymen to over-generalize and that we keep in mind that the most readers are not researchers, and that they desperately want the kind of pat open-and-shut answers that we won't be able to even begin providing until the study of Wikipedia is far better understood.
Likewise, users of Wikipedia research should be forewarned that researchers are apt to use simple words like "vandalism" when they are really measuring something far more specific and that surprising correlations between what is actually being measured and the things it is being measured against may produce misleading conclusions.
Cheers!
Gregory Maxwell wrote:
If you were using "is gay" as a measure of vandalism over time you might conclude that vandalism is decreasing when in reality "cluebot" is performing the same kind of analysis for its automatic vandalism suppression and the vandals have responded by vandalizing in forms that can't be automatically identified, such as by changing dates to incorrect values.
And if that's true, that's on net a bad thing. Most "is gay" vandalism (not all) is just stupid embarassing and it will be obvious to the reader as vandalism, and lots of people get how Wikipedia works and are reasonably tolerant of seeing that sort of thing from time to time.
But people expect that we should get the dates right, and they are right to ask that of us.
I understand that you're just making up a hypothetical, not saying that this is what is actually happening. I'm just agreeing with this line of thinking that says, in essence, "when we think about measuring vandalism, which is already hard enough, we also have to think about how damaging different kinds of vandalism actually are".
Greg, I think your email sounded a little negative at the start, but not so much further down. I think you would join me heartily in being super grateful for people doing this kind of analysis. Yes, some of it will be primitive and will suffer from the many difficulties. But data-driven decisionmaking is a great thing, particularly when we are cognizant of the limitations of the data we're using.
I just didn't want anyone to get the idea (and I'm sure I'm reading you right) that you were opposed to people doing research. :-)
--Jimbo
On Thu, Aug 20, 2009 at 12:46 PM, Jimmy Walesjwales@wikia-inc.com wrote: [snip]
Greg, I think your email sounded a little negative at the start, but not so much further down. I think you would join me heartily in being super grateful for people doing this kind of analysis. Yes, some of it will be primitive and will suffer from the many difficulties. But data-driven decisionmaking is a great thing, particularly when we are cognizant of the limitations of the data we're using.
I just didn't want anyone to get the idea (and I'm sure I'm reading you right) that you were opposed to people doing research. :-)
Absolutely— No one who has done thing kind of analysis could fail to appreciate the enormous amount of work that goes into even making a couple of simple seemingly "off the cuff" numbers out of the mountain of data that is Wikipedia.
Making sure the numbers are accurate and meaningful while also clearly explaining the process of generating is in and of itself a large amount of work, and my gratitude is extended to anyone who contributes to those processes.
I've long been a loud proponent of data driven decision making. So I'm absolutely not opposed to people doing research, but just as you said— we need to be acutely aware of the limitations of the research. Weak data is clearly better than no data, but only when you are aware of the strength of the data. Or, in other words, knowing what you don't know is often *the most critical* piece of information in any decision making process.
In our eagerness to establish what we can and do know it can be easy to forget how much we don't know. Some of the limitations which are all too obvious to researchers are less than obvious to people who've never personally done quantitative analysis on Wikipedia data, yet many of these people are the decision makers that must do something useful with the data. The casual language used when researchers write for researchers can magnify misunderstandings. It was merely my intent to caution against the related risks.
I think the most impactful contributions available for researchers today are less in the area of the direct research itself but are instead in advancing the art of researching Wikipedia. But the two go hand in hand, we can't advance the art if we don't do the research. The latter type is less sexy and not prone to generating headlines, but it is work that will last and generate citations for a long time. Measurements of X today will be soon forgotten as they are replaced by later analysis of the historical data using superior techniques.
That my tone was somewhat negative is only due to my extreme disappointment in that our own discussion of recent measurements has been almost entirely devoid of critical analysis. Contributors patting themselves on the back and saying "I told you so!" seem to be outnumbering suggestions that the research might mean something else entirely, though perhaps that is my own bias speaking. To the extent that I'm wrong about that I hope that my comments were merely redundant, to the extent that I'm right I hope my points will invite nuanced understanding of the research and encourage people to seek out and expose potentially confounding variables and bad-proxies so that all our knowledge can be advanced.
If this stuff were easy it would all be done already. Wikipedia research is interesting because it is both hard and potentially meaningful. There is room and need for contributions from everyone.
Cheers!
2009/8/20 Gregory Maxwell gmaxwell@gmail.com:
Going back to your simple study now: The analysis of vandalism duration and its impact on readers makes an assumption about readership which we know to be invalid. You're assuming a uniform distribution of readership: That readers are just as likely to read any random article. But we know that the actual readership follows a power-law (long-tail) distribution. Because of the failure to consider traffic levels we can't draw conclusions on how much vandalism readers are actually exposed to.
We're also assuming a uniform distribution of vandalism, as it were. There's a number of different types of vandalism; obscene defacement, malicious alteration of factual content, meaningless test edits of a character or two, schoolkids leaving messages for each other...
...and it all has a different impact on the reader.
This has two implications:
a) It seems safe to assume that replacing the entire article with "john is gay" is going to get spotted and reverted faster, on average, than an edit providing a plausible-sounding but entirely fictional history for a small town in Kansas. So, any changes in the pattern of the *content* of vandalism is going to lead to changes in the duration and thus overall frequency of it, even if the amount of vandal edits is constant.
b) We can easily compare the difference in effect for vandalism to be left on differently trafficed pages for various times - roughly speaking, time * traffic = number of readers affected. If some vandalism is worse than others, we could thus also calculate some kind of intensity metric - one hundred people viewing enormous genital piercing images on [[Kitten]] is probably worse than ten thousand people viewing "asdfdfggfh" at the end of a paragraph in the same article.
I'm not sure how we'd go ahead with the second one, but it's an interesting thing to think about.
On Thu, Aug 20, 2009 at 12:06 PM, Robert Rohderarohde@gmail.com wrote:
Given the nature of the approximations I made in doing this analysis I suspect it is more likely that I have somewhat underestimated the vandalism problem rather than overestimated it, but as I said in the beginning I'd like to believe I am in the right ballpark. If that's true, I personally think that having less than 0.5% of Wikipedia be vandalized at any given instant is actually rather comforting. It's not a perfect number, but it would suggest that nearly everyone still gets to see Wikipedia as intended rather than in a vandalized state. (Though to be fair I didn't try to figure out if the vandalism occurred in more frequently visited parts or not.)
Thanks for the excellent analysis, Robert. Just to give an idea of what 0.4% means in practice, you can think in terms of one country, 12 US counties, 33 Italian municipalities, 147 French municipalities or 1 Pope
Cruccone
Robert Rohde wrote:
When one downloads a dump file, what percentage of the pages are actually in a vandalized state?
This is equivalent to asking, if one chooses a random page from Wikipedia right now, what is the probability of receiving a vandalized revision?
Is there a possibility of re-running the numbers to include traffic weightings?
I would hypothesize from experience that if we adjust the "random page" selection to account for traffic (to get a better view of what people are actually seeing) we would see slightly different results.
I think we would see a lot less (percentagewise) vandalism that persists for a really long time for precisely the reason you identified: most vandalism that lasts a long time, lasts a long time because it is on obscure pages that no one is visiting. That doesn't mean it is not a problem, but it does change some thinking about what kinds of tools are needed to deal with that problem.
I'm not sure what else would change.
2009/8/20 Jimmy Wales jwales@wikia-inc.com:
Robert Rohde wrote:
When one downloads a dump file, what percentage of the pages are actually in a vandalized state?
This is equivalent to asking, if one chooses a random page from Wikipedia right now, what is the probability of receiving a vandalized revision?
Is there a possibility of re-running the numbers to include traffic weightings?
I'd like to see that data too. I'm sure you are right, vandalism doesn't last as long on popular pages, but it would be very interesting to see how much quicker it is reverted and how popular a page needs to be for that to apply (or if it is a gradual improvement).
2009/8/20 Robert Rohde rarohde@gmail.com:
I am supposed to be taking a wiki-vacation to finish my PhD thesis and find a job for next year. However, this afternoon I decided to take a break and consider an interesting question recently suggested to me by someone else: [snip]
That's an interesting bit of research, but, as you say, it is very crude. This study seems to have a better methodology, although it has a much smaller sample:
http://en.wikipedia.org/wiki/User:Aetheling/Vandalism_survival
If we could do that survey again with a large sample, it would be very interesting indeed.
wikimedia-l@lists.wikimedia.org