On Feb 13, 2008 11:43 AM, Cool Hand Luke failure.to.communicate@gmail.com wrote:
Having said that! Focusing on the actual evidence here - I'm not a mathematician or an expert in statistical analysis, and few of the folks working on this analytical project are either. However, it is clear to me that the sample suffers from a number of mathematical problems - mostly relating to its size and selection, and the significance attached to the results. If you want to make a comprehensive declaration based on this type of analysis, you need a much more robust set of data to work with and a much more serious approach to mining it for significant data points.
Nathan
As far as I know, no one has ever thought to do what I've done here. The SevenOfDiamonds case was happy just to list similar traits and plot a similar graph of editing patterns between the users. That wasn't enough for me: I wanted to know how meaningful similar editing patterns and traits are. So I did surveys with data as random as I could, and no one else is even remotely similar. You would have that count against the data. GWH protested that I didn't include enough similarly situated (ie, presumably New Yorker) editors. So I grabbed some of them too, and still nothing is as close as these two. It seems that some editors would not be satisfied until every last Wikipedian is compared.
That simply isn't necessary. Their editing patterns match well, and are rare (probably less than 1-in-21 based on what I've looked at so far, and that's while I'm trying to find users who will match). Many of their editing traits are shared by almost no other accounts (1-in-10 or less for at least a half dozen or so traits). If these variables are even somewhat independent, they multiply together into very long odds.
And the interleaving is really compelling based on the editing style. GWH protested that my sample users don't resemble their editing patterns well enough (as indeed no accounts do), but by comparing an editor who makes a significant proportion of their edits while Mantanmoreland doesn't edit, I would have expected to find LESS intersections, not more. These users never edited at the same time as each other, and this is at least eyebrow-raising.
All of these things together are damning--and that's without even keeping our eye on the ball--that these users shared POV, that Samiharris started editing Wiess within one day of Mantanmoreland quit, and that Mantanmoreland knew his edits would be monitored, so had a motive to spin a sock for possible COI abuse.
Quack.
Cool Hand Luke
Some points -
1. A lot of the statistical stuff still needs a lot of work. The section 13 analysis really does need re-running with editors who are whole-day better matches to the edit pattern, for example. You and others brushed this off - that's not a reasonable response. At this point section 13 is faulty.
2. *All* of the statistical stuff, particularly the user phrase analysis, should be flipped around and run the other way to generate anything like a solid profile. If user X uses phrases A, B, and C a lot, and user Y uses phrases A, B, and C a lot, then they're similar in that sense. But to understand the level of confidence in their similarity, one then should flip it around and analyze the database contents, and find users who use phrases A, B, and C a lot, and then see if the number and distribution of users implies that the similarity in that aspect is truly uncommon or not uncommon.
It's been baldly asserted by a number of people that "Of course" similarities imply linkage - the reverse analysis shows how likely the linkage is to be causal as opposed to merely statistical. It may well be that 5% of total users fit profile of "uses A, B, and C", and that we would statistically *expect* that out of (these numbers representative but made up) about 1 million US users, we would find about say 3,000 in New York City, of whom about 150 would also use phrases A, B, and C, and about say 25 of whom would share a similar editing time profile. Given a whole US population, the odds would then be high that somewhere, in some metro area, are two factually unrelated people who use a given set of similar phrase patterns, the same edit time of day profile, and some overlapping article interests.
Those are made up numbers - however, they might well be pretty close to reality (or not - I haven't run them, and statistical guesses are a really bad analytical tool 8-).
3. This whole incident has several aspects which are very unlike previous sockpuppet analysis I am aware of. In a sense, this is good - we're starting to develop some tools to determine analytically stuff that previously has been gut feelings. It's also showing that we have some disagreement as to the validity of the statistical methods, and experience with statistics and analysis.
In my opinion, having done statistics in college, in workplaces, and in scientific analysis from time to time, the investigative methods used so far are applicable but have not been applied with sufficient statistical depth and rigor... Yet. I don't see any reason why people can't perform the rest of the research to answer these other questions, and I think that we can eventually reach the point that my doubts are answered from a truly statistical likelyhood sense of "proving the case".
I believe that IF you intend to use statistical methods to try and prove the case, the burden of proof should be on those "prosecuting" the case to convince skeptics that the statistical analysis has gotten good enough. I am not yet convinced, from a statistical sense. There are too many unknowns, and the rigor of the work isn't there yet. We don't yet have agreement on what level of rigor and testing is required. But I think we can find that agreement.
4. All of that said, through the haze, there's a sense that a duck has been sighted.
How we handle a combination of duck test feeling and statistical findings to reach a final conclusion is up to arbcom and community consensus.