While the time and effort that went into Robert Rohde's analysis is certainly extensive, the outcomes are based on so many flawed assumptions about the nature of vandalism and vandalism reversion, publicize at one's peril the key "finding" of a 0.4% vandalism rate.
http://en.wikipedia.org/w/index.php?title=John_McCain&diff=169808394&... 11 hours Reverted with no tags.
http://en.wikipedia.org/w/index.php?title=Maria_Cantwell&diff=prev&o... 46 days Reverted with note: "Undid revision 160400298 by 75.133.82.218" By the way, there was a two-minute vandalism in the interim, so in many cases, just because an analyst finds a "recent and short" incident, he or she may be completely missing a longer-term incident.
http://en.wikipedia.org/w/index.php?title=Ted_Stevens&diff=prev&oldi... There goes your "rvv" theory. In this case, "rvv" was a flag for even more preposterous vandalism.
The notion that these are lightly-watched or lightly-edited articles is a bit difficult to swallow, since they are the biographical articles about three United States senators. These articles were analyzed by an independent team of volunteers, and we found that the 100 senatorial articles were in deliberate disrepair about 6.8% of the time, which would vastly differ from Rohde's analysis. Certainly, one could argue that articles about political figures may be vandalized more often, but one might also counter that argument with the assumption that "more eyes" ought to be watching these articles and repairing them. More detail here:
http://www.mywikibiz.com/Wikipedia_Vandalism_Study
Admittedly, there were some minor flaws with our study's methodology, too. These are reviewed on the Discussion page. But, as with Rohde's assessment, if anything, we may have underrepresented the problem at 6.8%.
I remain unimpressed with Wikipedia's accuracy rate, and I am bewildered why "flagged revisions" have not been implemented yet.
Greg
On Thu, Aug 20, 2009 at 12:59 PM, Gregory Kohs thekohser@gmail.com wrote:
While the time and effort that went into Robert Rohde's analysis is certainly extensive, the outcomes are based on so many flawed assumptions about the nature of vandalism and vandalism reversion, publicize at one's peril the key "finding" of a 0.4% vandalism rate.
http://en.wikipedia.org/w/index.php?title=John_McCain&diff=169808394&... 11 hours Reverted with no tags.
The best part about that little exchange is: http://en.wikipedia.org/w/index.php?title=John_McCain&diff=next&oldi...
wherein a revert was made returning the vandalism, followed by another when the editor noticed his error.
I don't think Robert made any firm conclusions on the meaning of his data; he notes all the caveats that others have since emphasized, and admits to likely underestimating vandalism. I read the 0.4% as representing the approximate number of articles containing vandalism in an English Wikipedia snapshot; that is quite different than the amount of time specific articles stay in a "vandalized" state. Given the difficulty of accurately analyzing this sort of data, no firm conclusions can be drawn; but certainly its more informative than a Wikipedia Review analysis of a relatively small group of articles in a specific topic area.
Nathan
Nathan said:
"...but certainly its (sic) more informative than a Wikipedia Review analysis of a relatively small group of articles in a specific topic area."
And you are certainly entitled to a flawed opinion based on incorrect assumptions, such as ours being a "Wikipedia Review" analysis. But, nice try at a red herring argument.
Greg
On Thu, Aug 20, 2009 at 1:30 PM, Gregory Kohs thekohser@gmail.com wrote:
Nathan said:
"...but certainly its (sic) more informative than a Wikipedia Review analysis of a relatively small group of articles in a specific topic area."
And you are certainly entitled to a flawed opinion based on incorrect assumptions, such as ours being a "Wikipedia Review" analysis. But, nice try at a red herring argument.
Greg
Well, you can understand where I would get that idea - since the URL you provided had "Wikipedia Review members" performing the research, until you changed it a few minutes ago.
http://www.mywikibiz.com/index.php?title=Wikipedia_Vandalism_Study&diff=...
My point (which might still be incorrect, of course) was that an analysis based on 30,000 randomly selected pages was more informative about the English Wikipedia than 100 articles about serving United States Senators.
Nathan
On Thu, Aug 20, 2009 at 1:55 PM, Nathan nawrich@gmail.com wrote:
My point (which might still be incorrect, of course) was that an analysis based on 30,000 randomly selected pages was more informative about the English Wikipedia than 100 articles about serving United States Senators.
Any automated method of finding vandalism is doomed to failure. I'd say its informativeness was precisely zero.
Greg's analysis, on the other hand, was informative, but it was targeted at a much different question than Robert's.
"if one chooses a random page from Wikipedia right now, what is the probability of receiving a vandalized revision" The best way to answer that question would be with a manually processed random sample taken from a pre-chosen moment in time. As few as 1000 revisions would probably be sufficient, if I know anything about statistics, but I'll let someone with more knowledge of statistics verify or refute that. The results will depend heavily on one's definition of "vandalism", though.
On Thu, Aug 20, 2009 at 12:38 PM, Jimmy Wales jwales@wikia-inc.com wrote:
Is there a possibility of re-running the numbers to include traffic weightings?
definitely should be done
I would hypothesize from experience that if we adjust the "random page" selection to account for traffic (to get a better view of what people are actually seeing) we would see slightly different results.
I think we'd see drastically different results.
I think we would see a lot less (percentagewise) vandalism that persists for a really long time for precisely the reason you identified: most vandalism that lasts a long time, lasts a long time because it is on obscure pages that no one is visiting.
Agreed. On the other hand, I think we'd also see that pages with more traffic are more likely to be vandalized.
Of course, this assumes a valid methodology. Using "admin rollback, the undo function, the revert bots, various editing tools, and commonly used phrases like "rv", "rvv", etc." to find vandalism is heavily skewed toward vandalism that doesn't last very long (or at least doesn't last very many edits). It's basically useless.
On Thu, Aug 20, 2009 at 2:10 PM, Anthonywikimail@inbox.org wrote:
On Thu, Aug 20, 2009 at 1:55 PM, Nathan nawrich@gmail.com wrote:
My point (which might still be incorrect, of course) was that an analysis based on 30,000 randomly selected pages was more informative about the English Wikipedia than 100 articles about serving United States Senators.
Any automated method of finding vandalism is doomed to failure. I'd say its informativeness was precisely zero.
Greg's analysis, on the other hand, was informative, but it was targeted at a much different question than Robert's.
"if one chooses a random page from Wikipedia right now, what is the probability of receiving a vandalized revision" The best way to answer that question would be with a manually processed random sample taken from a pre-chosen moment in time. As few as 1000 revisions would probably be sufficient, if I know anything about statistics, but I'll let someone with more knowledge of statistics verify or refute that. The results will depend heavily on one's definition of "vandalism", though.
Only in dreadfully obvious cases can you look at a revision by itself and know it contains vandalism. If the goal is really to characterize whether any vandalism has persisted in an article from any time in the past, then one really needs to look at the full edit history to see what has been changed / removed over time.
Even at the level of randomly sampling 1000 revisions, doing an real evaluation of the full history is likely to be impractical for any manual process.
If however you restrict yourself to asking whether 1000 edits contributed vandalism, then you have a relatively manageable task, and one that is more closely analogous to the technical program I set up. If it helps one can think of what I did as trying to characterize reverts and detect the persistence of "new vandalism" rather than "vandalism" in general. And of course, only "new vandalism" could be fixed by an immediate rollback / revert anyway.
Qualitatively I tend to think that vandalism that has persisted through many intervening revisions is in a rather different category than new vandalism. Since people rarely look at or are aware of an articles' ancient past, such persistent vandalism is at that point little different than any other error in an article. It is something to be fixed, but you won't usually be able to recognize it as a malicious act.
On Thu, Aug 20, 2009 at 12:38 PM, Jimmy Wales jwales@wikia-inc.com wrote:
Is there a possibility of re-running the numbers to include traffic weightings?
definitely should be done
Does anyone have a nice comprehensive set of page traffic aggregated at say a month level? The raw data used by stats.grok.se, etc. is binned hourly which opens one up to issues of short-term fluctuations, but I'm not at all interested in downloading 35 GB of hourly files just to construct my own long-term averages.
I would hypothesize from experience that if we adjust the "random page" selection to account for traffic (to get a better view of what people are actually seeing) we would see slightly different results.
I think we'd see drastically different results.
If I had to make a prediction, I'd expect one might see numerically higher rates of vandalism and shorter average durations, but otherwise qualitatively similar results given the same methodology. I agree though that it would be worth doing the experiment.
I think we would see a lot less (percentagewise) vandalism that persists for a really long time for precisely the reason you identified: most vandalism that lasts a long time, lasts a long time because it is on obscure pages that no one is visiting.
Agreed. On the other hand, I think we'd also see that pages with more traffic are more likely to be vandalized.
Of course, this assumes a valid methodology. Using "admin rollback, the undo function, the revert bots, various editing tools, and commonly used phrases like "rv", "rvv", etc." to find vandalism is heavily skewed toward vandalism that doesn't last very long (or at least doesn't last very many edits). It's basically useless.
Yes, as I acknowledged above, "new vandalism". My personal interest is also skewed in that direction. If you don't like it and don't find it useful, feel free to ignore me and/or do your own analysis. Vandalism that has persisted through many revisions is a qualitatively different critter than most new vandalism. It's usually hard to identify, even by a manual process, and is unlikely to be fixed except through the normal editoral process of review, fact-checking, and revision. When vandalism is "new" people are at least paying attention to it in particular, and all vandalism starts out that way. Perhaps it would be more useful if you think of this work as a characterization of revert statistics?
Anyway, I provided my data point and described what I did so others could judge it for themselves. Regardless of your opinion, it addressed an issue of interest to me, and I would hope others also find some useful insight in it.
-Robert Rohde
Robert Rohde wrote:
Does anyone have a nice comprehensive set of page traffic aggregated at say a month level? The raw data used by stats.grok.se, etc. is binned hourly which opens one up to issues of short-term fluctuations, but I'm not at all interested in downloading 35 GB of hourly files just to construct my own long-term averages.
I don't have every article, but I have the data for July 09 for ~600,000 pages on enwiki (mostly articles). It also has the hit counts for redirects aggregated with the article, not sure if that would be more or less useful for you. Let me know if you want it, its in a MySQL table on the toolserver right now.
On Thu, Aug 20, 2009 at 6:36 PM, Robert Rohde rarohde@gmail.com wrote:
On Thu, Aug 20, 2009 at 2:10 PM, Anthonywikimail@inbox.org wrote:
"if one chooses a random page from Wikipedia right now, what is the probability of receiving a vandalized revision" The best way to answer
that
question would be with a manually processed random sample taken from a pre-chosen moment in time. As few as 1000 revisions would probably be sufficient, if I know anything about statistics, but I'll let someone
with
more knowledge of statistics verify or refute that. The results will
depend
heavily on one's definition of "vandalism", though.
Only in dreadfully obvious cases can you look at a revision by itself and know it contains vandalism. If the goal is really to characterize whether any vandalism has persisted in an article from any time in the past, then one really needs to look at the full edit history to see what has been changed / removed over time.
I wouldn't suggest looking at the edit history at all, just the most recent revision as of whatever moment in time is chosen. If vandalism is found, then and only then would one look through the edit history to find out when it was added.
Of course, this assumes a valid methodology. Using "admin rollback, the undo function, the revert bots, various editing tools, and commonly used phrases like "rv", "rvv", etc." to find vandalism is heavily skewed
toward
vandalism that doesn't last very long (or at least doesn't last very many edits). It's basically useless.
Yes, as I acknowledged above, "new vandalism".
"New vandalism" which has not yet been reverted wouldn't be included.
My personal interest is also skewed in that direction. If you don't like it and don't find it useful, feel free to ignore me and/or do your own analysis.
I do. I also feel free to criticize your methods publicly, since you decided to share them publicly.
Anyway, I provided my data point and described what I did so others could judge it for themselves. Regardless of your opinion, it addressed an issue of interest to me, and I would hope others also find some useful insight in it.
And I presented my criticism, which hopefully other will find some useful insight in as well.
2009/8/20 Anthony wikimail@inbox.org:
I wouldn't suggest looking at the edit history at all, just the most recent revision as of whatever moment in time is chosen. If vandalism is found, then and only then would one look through the edit history to find out when it was added.
That only works if the article is very well referenced and you have all the references and are willing to fact-check everything. Otherwise you will miss subtle vandalism like changing the date of birth by a year.
On Thu, Aug 20, 2009 at 6:57 PM, Thomas Dalton thomas.dalton@gmail.comwrote:
2009/8/20 Anthony wikimail@inbox.org:
I wouldn't suggest looking at the edit history at all, just the most
recent
revision as of whatever moment in time is chosen. If vandalism is found, then and only then would one look through the edit history to find out
when
it was added.
That only works if the article is very well referenced and you have all the references and are willing to fact-check everything. Otherwise you will miss subtle vandalism like changing the date of birth by a year.
No need for the article to be referenced at all, but yes, it would be time consuming, or at least person-time consuming. On the other hand, it'd answer the question, in a way that an automated process never could do (assuming I've got my statistical analysis right, anyway: http://www.raosoft.com/samplesize.html seems to suggest a 99% confidence level for 664 random samples out of 3 million, but I'm not sure what "response distribution" means).
2009/8/21 Anthony wikimail@inbox.org:
On Thu, Aug 20, 2009 at 6:57 PM, Thomas Dalton thomas.dalton@gmail.comwrote:
2009/8/20 Anthony wikimail@inbox.org:
I wouldn't suggest looking at the edit history at all, just the most
recent
revision as of whatever moment in time is chosen. If vandalism is found, then and only then would one look through the edit history to find out
when
it was added.
That only works if the article is very well referenced and you have all the references and are willing to fact-check everything. Otherwise you will miss subtle vandalism like changing the date of birth by a year.
No need for the article to be referenced at all, but yes, it would be time consuming, or at least person-time consuming.
You mean you could go and find references for the information yourself? I suppose you could, but that is completely impractical.
On the other hand, it'd answer the question, in a way that an automated process never could do (assuming I've got my statistical analysis right, anyway: http://www.raosoft.com/samplesize.html seems to suggest a 99% confidence level for 664 random samples out of 3 million, but I'm not sure what "response distribution" means).
The site looks like it is for surveys made up of yes/no questions, I don't think it is going to apply to this.
On Thu, Aug 20, 2009 at 7:20 PM, Thomas Dalton thomas.dalton@gmail.comwrote:
2009/8/21 Anthony wikimail@inbox.org:
On Thu, Aug 20, 2009 at 6:57 PM, Thomas Dalton <thomas.dalton@gmail.com wrote:
2009/8/20 Anthony wikimail@inbox.org:
I wouldn't suggest looking at the edit history at all, just the most
recent
revision as of whatever moment in time is chosen. If vandalism is
found,
then and only then would one look through the edit history to find out
when
it was added.
That only works if the article is very well referenced and you have all the references and are willing to fact-check everything. Otherwise you will miss subtle vandalism like changing the date of birth by a year.
No need for the article to be referenced at all, but yes, it would be
time
consuming, or at least person-time consuming.
You mean you could go and find references for the information yourself? I suppose you could, but that is completely impractical.
My God. If a few dozen people couldn't easily determine to a relatively high degree of certainty what portion of a mere 0.03% of Wikipedia's articles are *vandalized*, how useless is Wikipedia?
On the other hand, it'd
answer the question, in a way that an automated process never could do (assuming I've got my statistical analysis right, anyway: http://www.raosoft.com/samplesize.html seems to suggest a 99% confidence level for 664 random samples out of 3 million, but I'm not sure what "response distribution" means).
The site looks like it is for surveys made up of yes/no questions, I don't think it is going to apply to this.
"Is this article vandalized?" is a yes/no question...
2009/8/21 Anthony wikimail@inbox.org:
My God. If a few dozen people couldn't easily determine to a relatively high degree of certainty what portion of a mere 0.03% of Wikipedia's articles are *vandalized*, how useless is Wikipedia?
I never said they couldn't. I said they couldn't do it by just looking at the most recent revision.
2009/8/21 Anthony wikimail@inbox.org:
"Is this article vandalized?" is a yes/no question...
True, but that isn't actually the question that this research tried to answer. It tried to answer "How much time has this article spent in a vandalised state?". If we are only interested in whether the most recent revision is vandalised then that is a simpler problem but would require a much larger sample to get the same quality of data.
On Thu, Aug 20, 2009 at 7:54 PM, Thomas Dalton thomas.dalton@gmail.comwrote:
2009/8/21 Anthony wikimail@inbox.org:
"Is this article vandalized?" is a yes/no question...
True, but that isn't actually the question that this research tried to answer. It tried to answer "How much time has this article spent in a vandalised state?".
"When one downloads a dump file, what percentage of the pages are actually in a vandalized state?"
"This is equivalent to asking, if one chooses a random page from Wikipedia right now, what is the probability of receiving a vandalized revision?"
That's the question I was referring to.
If we are only interested in whether the most recent revision is vandalised then that is a simpler problem but would require a much larger sample to get the same quality of data.
How much larger? Do you know anything about this, or you're just guessing? The number of random samples needed for a high degree of confidence tends to be much much less than most people suspect. That much I know.
I found one problem with my use of http://www.raosoft.com/samplesize.html
http://www.raosoft.com/samplesize.htmlI was specifying a margin of error of 5%. But that's an absolute margin of error. So if it were 0.2% vandalism, that'd be 0.2% plus or minus 5%. Obviously unacceptable.
However, the response distribution would then be 0.2%. This still would require 7649 samples for a 95% confidence plus or minus 0.1%. If the vandalism turned out to be more prevalent though, and I suspect it would, we could for instance be 95% confident plus or minus 0.5% if the response distribution was 0.5% and we had 765 samples.
2009/8/21 Anthony wikimail@inbox.org:
If we are only interested in whether the most recent revision is vandalised then that is a simpler problem but would require a much larger sample to get the same quality of data.
How much larger? Do you know anything about this, or you're just guessing? The number of random samples needed for a high degree of confidence tends to be much much less than most people suspect. That much I know.
I have a Masters degree in Mathematics, so I know a little about the subject. (I didn't study much statistics, but you can't do 4 years of Maths at Uni without getting some basic understanding of it.)
You say it requires 7649 articles, which sounds about right to me. If we looked through the entire history (or just the last year or 6 months or something if you want just recent data) then we could do it with significantly fewer articles. I'm not sure how many we would need, though. I think we need to know what the distribution is for how long a randomly chosen article spends in a vandalised state before we can work out what the distribution of the average would be. My statistics isn't good enough to even work out what kind of distribution it is likely to be, I certainly can't guess at the parameters. It obviously ranges between 0% and 100% with the mean somewhere close to 0% (0.4% seems like a good estimate) and will presumably have a long tail (truncated at 100%) - there are articles that spend their entire life in a vandalised state (attack pages, for example) and there is a chance we'll completely miss such a page and it will last the entire length of the survey period, so the probability density at 100% won't be 0. I'm sure there is a distribution that satisfies those requirements, but I don't know what it is.
On Thu, Aug 20, 2009 at 3:57 PM, Thomas Daltonthomas.dalton@gmail.com wrote:
2009/8/20 Anthony wikimail@inbox.org:
I wouldn't suggest looking at the edit history at all, just the most recent revision as of whatever moment in time is chosen. If vandalism is found, then and only then would one look through the edit history to find out when it was added.
That only works if the article is very well referenced and you have all the references and are willing to fact-check everything. Otherwise you will miss subtle vandalism like changing the date of birth by a year.
It's not just facts. There are many ways to degrade the qualify of an article (such as removing entire sections) that would be invisible if one looks at only one revision.
Anthony seems to be talking about a question of article accuracy (unless I am misreading him). That is overlapping issue with addressing vandalism, but there are a significant number of ways to commit vandalism that nonetheless have nothing to do with impairing the resulting article's accuracy.
-Robert Rohde
On Thu, Aug 20, 2009 at 7:13 PM, Robert Rohde rarohde@gmail.com wrote:
On Thu, Aug 20, 2009 at 3:57 PM, Thomas Daltonthomas.dalton@gmail.com wrote:
2009/8/20 Anthony wikimail@inbox.org:
I wouldn't suggest looking at the edit history at all, just the most
recent
revision as of whatever moment in time is chosen. If vandalism is
found,
then and only then would one look through the edit history to find out
when
it was added.
That only works if the article is very well referenced and you have all the references and are willing to fact-check everything. Otherwise you will miss subtle vandalism like changing the date of birth by a year.
It's not just facts. There are many ways to degrade the qualify of an article (such as removing entire sections) that would be invisible if one looks at only one revision.
I guess that's true. People could be removing facts, for instance, which wouldn't be apparently by looking at one revision. So such an analysis would potentially understate actual vandalism. But at least we'd know in which direction the percentage is potentially wrong. And anecdotally, I don't think the understatement would be significant.
There's also the question of whether or not we want to count an article which had a fact removed a few years ago and never re-added to be a "vandalized revision".
Anthony seems to be talking about a question of article accuracy
(unless I am misreading him).
I'm attempting to best answer the question "if one chooses a random page from Wikipedia right now, what is the probability of receiving a vandalized revision", which I take to have nothing whatsoever to do with the number of reverts.
That is overlapping issue with addressing vandalism, but there are a significant number of ways to commit vandalism that nonetheless have nothing to do with impairing the resulting article's accuracy.
Significant number? I can only think of a handful.
On Thu, Aug 20, 2009 at 4:37 PM, Anthonywikimail@inbox.org wrote:
On Thu, Aug 20, 2009 at 7:13 PM, Robert Rohde rarohde@gmail.com wrote:
On Thu, Aug 20, 2009 at 3:57 PM, Thomas Daltonthomas.dalton@gmail.com wrote:
2009/8/20 Anthony wikimail@inbox.org:
I wouldn't suggest looking at the edit history at all, just the most
recent
revision as of whatever moment in time is chosen. If vandalism is
found,
then and only then would one look through the edit history to find out
when
it was added.
That only works if the article is very well referenced and you have all the references and are willing to fact-check everything. Otherwise you will miss subtle vandalism like changing the date of birth by a year.
It's not just facts. There are many ways to degrade the qualify of an article (such as removing entire sections) that would be invisible if one looks at only one revision.
I guess that's true. People could be removing facts, for instance, which wouldn't be apparently by looking at one revision. So such an analysis would potentially understate actual vandalism. But at least we'd know in which direction the percentage is potentially wrong. And anecdotally, I don't think the understatement would be significant.
You seem to be identifying all errors with vandalism. Sometimes factual errors are simply unintentional mistakes. I agree that accuracy is important, but I think you are thinking about the question somewhat differently than I am.
<snip>
I'm attempting to best answer the question "if one chooses a random page from Wikipedia right now, what is the probability of receiving a vandalized revision", which I take to have nothing whatsoever to do with the number of reverts.
Let me describe the issue differently. The practical issue I am concerned with might be better expressed as the following: For any given article, what is the probability that the current revision is not the best available revision (i.e. most accurate, most complete, etc.) Vandalism, in general, takes a page and makes it worse. I am interested in the problem of characterizing how often this happens with an eye to being able to go back to that prior better version. (This also explains why I am less interested in vandalism that persists through many revisions. Once that occurs, it makes less sense to try and go back to the pre-vandalized revision.)
Your concern for establishing overall article accuracy is a good one, but it is largely orthogonal to my interest in figuring out whether the current revision is likely to be better or worse than the revisions that came before it.
-Robert Rohde
On Thu, Aug 20, 2009 at 7:58 PM, Robert Rohde rarohde@gmail.com wrote:
You seem to be identifying all errors with vandalism.
How so?
Sometimes factual errors are simply unintentional mistakes.
Obviously we can't know the intent of the person for sure, but after a mistake is found it's relatively simple to find where it was added and decide whether or not we are going to call it vandalism. This is an inherent problem with answering the question. If you can't determine it manually, you sure as hell won't be able to determine it using automated methods.
Let me describe the issue differently. The practical issue I am concerned with might be better expressed as the following: For any given article, what is the probability that the current revision is not the best available revision (i.e. most accurate, most complete, etc.) Vandalism, in general, takes a page and makes it worse. I am
interested in the problem of characterizing how often this happens
with an eye to being able to go back to that prior better version. (This also explains why I am less interested in vandalism that persists through many revisions. Once that occurs, it makes less sense to try and go back to the pre-vandalized revision.)
*nod*. Yes, we certainly have different things we're interested in measuring. If someone vandalizes an article, say to change the population of a country from 3 million to 2.9 million, and then 20 other people improve the article without fixing that fact, I'd still count that as vandalized.
On the other hand, are you sure you don't want to add an "indisputably" before "not the best available revision"? I mean, I'd say Wikipedia is probably in the double digit percentages, at least in terms of popular articles, if you don't add "indisputably".
On Thu, Aug 20, 2009 at 14:10, Anthonywikimail@inbox.org wrote:
On Thu, Aug 20, 2009 at 1:55 PM, Nathan nawrich@gmail.com wrote:
My point (which might still be incorrect, of course) was that an analysis based on 30,000 randomly selected pages was more informative about the English Wikipedia than 100 articles about serving United States Senators.
Any automated method of finding vandalism is doomed to failure. I'd say its informativeness was precisely zero.
Greg's analysis, on the other hand, was informative, but it was targeted at a much different question than Robert's.
"if one chooses a random page from Wikipedia right now, what is the probability of receiving a vandalized revision" The best way to answer that question would be with a manually processed random sample taken from a pre-chosen moment in time. As few as 1000 revisions would probably be sufficient, if I know anything about statistics, but I'll let someone with more knowledge of statistics verify or refute that. The results will depend heavily on one's definition of "vandalism", though.
I did this in an informal fashion in 2005 during my "hundred article" surveys. Of the 503 pages I looked at, only one was clearly vandalized the first time I looked at it, so I'd say a thousand samples is probably too small to get any sort of precision on the vandalism rate.
On Thu, Aug 20, 2009 at 9:30 PM, Mark Wagner carnildo@gmail.com wrote:
On Thu, Aug 20, 2009 at 14:10, Anthonywikimail@inbox.org wrote:
"if one chooses a random page from Wikipedia right now, what is the probability of receiving a vandalized revision" The best way to answer
that
question would be with a manually processed random sample taken from a pre-chosen moment in time. As few as 1000 revisions would probably be sufficient, if I know anything about statistics, but I'll let someone
with
more knowledge of statistics verify or refute that. The results will
depend
heavily on one's definition of "vandalism", though.
I did this in an informal fashion in 2005 during my "hundred article" surveys. Of the 503 pages I looked at, only one was clearly vandalized the first time I looked at it, so I'd say a thousand samples is probably too small to get any sort of precision on the vandalism rate.
Why? My understanding is that, if your methodology was correct, you can say with 96% confidence that the percentage of vandalized articles is less than 0.6%. That's useful. With 1000 samples, if you found two instances of vandalism, you'd have a 97% confidence that the percentage of vandalized articles is less than 0.5%.
I don't think it's that low, but if you publish the details of your "hundred article" surveys, I might be persuaded that it is.
If we really do have that figure to that level of assurance, a more useful statistic would be the percentage of pageviews that result in a vandalized article. That could be arrived at by weighting by pageviews while choosing your random sample.
One flaw I found in my proposed methodology is that the "moment in time" needs to be randomized, since certain times of the day/week/year might very well experience higher vandalism than others.
Apologies to Nathan regarding the "Wikipedia Review" description. The analysis team was, indeed, recruited via Wikipedia Review; however, almost all of the participants in the research have now departed or reduced their participation in Wikipedia Review to such a degree, I don't personally consider it to have been a "Wikipedia Review" effort at all. I allowed my personal opinions to interfere with my recollection of the facts, though, and that's not kosher. Again, I hope you'll accept my apology.
I still maintain, however, that any study of the accuracy of or the vandalized nature of Wikipedia content will be far more reliable and meaningful if human assessment is the underlying mechanism of analysis, rather than a "bot" or "script" that will simply tally up things. I think that Rohde's design was inherently flawed, and I'm happy that Greg Maxwell and I both immediately recognized the danger of running off and "reporting the good news", as Sue Gardner was apparently ready to do immediately.
As I said, I feel that Rohde proceeded with research based on several highly questionable assumptions, while the "100 Senators" research rather carefully outlined a research plan that carried very few assumptions, other than that you trust the analysts to intelligently recognize vandalism or not. Nathan, by praising Rohde's work and disparaging my own, you seem to be suggesting that you would prefer to live inside a giant mountain comprised of sticks and twigs, rather than in a small, pleasantly furbished 12' x 12' room. I just don't understand that line of thinking. I'd rather have a small bit of reliable data based on a stable premise, rather than a giant pile of data based on an unstable premise.
Greg
Riddle me this...
Is the edit below vandalism?
http://en.wikipedia.org/w/index.php?title=Arch_Coal&diff=255482597&o...
Did the edit take a page and make it worse? Or, did it make the page a "better available revision" than the version immediately prior to it?
Methinks the Wikipedia community has a long way to go in learning to differentiate between a "better" encyclopedia and a "worse" encyclopedia before we take the step to try to define vandalism. Then, after we've done all that, there might be some remaining value in trying to quantify vandalism, as we've defined it.
Until then, for God's sake, Sue Gardner, do not gleefully run off publicizing that only 0.4% of Wikipedia's articles are vandalized.
Greg
Gregory Kohs wrote:
Riddle me this...
Is the edit below vandalism?
http://en.wikipedia.org/w/index.php?title=Arch_Coal&diff=255482597&o...
Did the edit take a page and make it worse? Or, did it make the page a "better available revision" than the version immediately prior to it?
It wasn't vandalism, and wasn't labelled as such, merely a change of wording, and perhaps emphasis. In the case of dispute, it should have gone to the Talk page, but doesn't seem to have done so. Many editors undo and revert on the basis of felicity of language and emphasis, and unless it becomes an issue is an epiphenomenon of "the encyclopedia that anyone can edit". so I can't see how this is a good example of anything in particular.
Methinks the Wikipedia community has a long way to go in learning to differentiate between a "better" encyclopedia and a "worse" encyclopedia before we take the step to try to define vandalism. Then, after we've done all that, there might be some remaining value in trying to quantify vandalism, as we've defined it.
With multiplicitous interests being represented, all of them valid, and with very little general intersection, terms such as "better" and "worse" have little meaning, in my view, in that context. Nobody is qualified to make that assessment.
Until then, for God's sake, Sue Gardner, do not gleefully run off publicizing that only 0.4% of Wikipedia's articles are vandalized.
Unless it is said that "a recent informal study has shown that...."; I don't think Robert claimed any rigorous validity for the work he did; but at least he's done it, and opened a debate.
Phil Nash wrote:
"Many editors undo and revert on the basis of felicity of language and emphasis, and unless it becomes an issue is an epiphenomenon of "the encyclopedia that anyone can edit". so I can't see how this is a good example of anything in particular."
And, with point proven, I rest my case.
Greg
And here is where many of the flaws of the University of Minnesota study were exposed:
http://chance.dartmouth.edu/chancewiki/index.php/Chance_News_31#The_Unbreaka...
Their methodology of tracking the persistence of words was questionable, to say the least.
And here was my favorite part:
*"We exclude anonymous editors from some analyses, because IPs are not stable: multiple edits by the same human might be recorded under different IPs, and multiple humans can share an IP.*"
So, in a study evaluating the "damaged views" within 34 trillion edits, they excluded the 9 trillion edits by IP addresses? If you're not laughing right now, then you must be new to Wikipedia.
Greg
On Thu, Aug 20, 2009 at 11:02 PM, Gregory Kohs thekohser@gmail.com wrote:
And here was my favorite part:
*"We exclude anonymous editors from some analyses, because IPs are not stable: multiple edits by the same human might be recorded under different IPs, and multiple humans can share an IP.*"
I have to say that this one was "better": "We believe it is reasonable to assume that essentially all damage is repaired within 15 revisions." Talk about begging the question.
wikimedia-l@lists.wikimedia.org