On Feb 13, 2008 11:43 AM, Cool Hand Luke
that! Focusing on the actual evidence here - I'm not a
mathematician or an expert in statistical analysis, and few of the
folks working on this analytical project are either. However, it is
clear to me that the sample suffers from a number of mathematical
problems - mostly relating to its size and selection, and the
significance attached to the results. If you want to make a
comprehensive declaration based on this type of analysis, you need a
much more robust set of data to work with and a much more serious
approach to mining it for significant data points.
As far as I know, no one has ever thought to do what I've done here. The
SevenOfDiamonds case was happy just to list similar traits and plot a
similar graph of editing patterns between the users. That wasn't enough for
me: I wanted to know how meaningful similar editing patterns and traits
are. So I did surveys with data as random as I could, and no one else is
even remotely similar. You would have that count against the data. GWH
protested that I didn't include enough similarly situated (ie, presumably
New Yorker) editors. So I grabbed some of them too, and still nothing is as
close as these two. It seems that some editors would not be satisfied until
every last Wikipedian is compared.
That simply isn't necessary. Their editing patterns match well, and are
rare (probably less than 1-in-21 based on what I've looked at so far, and
that's while I'm trying to find users who will match). Many of their editing
traits are shared by almost no other accounts (1-in-10 or less for at least
a half dozen or so traits). If these variables are even somewhat
independent, they multiply together into very long odds.
And the interleaving is really compelling based on the editing style. GWH
protested that my sample users don't resemble their editing patterns well
enough (as indeed no accounts do), but by comparing an editor who makes a
significant proportion of their edits while Mantanmoreland doesn't edit, I
would have expected to find LESS intersections, not more. These users never
edited at the same time as each other, and this is at least eyebrow-raising.
All of these things together are damning--and that's without even keeping
our eye on the ball--that these users shared POV, that Samiharris started
editing Wiess within one day of Mantanmoreland quit, and that Mantanmoreland
knew his edits would be monitored, so had a motive to spin a sock for
possible COI abuse.
Cool Hand Luke
Some points -
1. A lot of the statistical stuff still needs a lot of work. The
section 13 analysis really does need re-running with editors who are
whole-day better matches to the edit pattern, for example. You and
others brushed this off - that's not a reasonable response. At this
point section 13 is faulty.
2. *All* of the statistical stuff, particularly the user phrase
analysis, should be flipped around and run the other way to generate
anything like a solid profile. If user X uses phrases A, B, and C a
lot, and user Y uses phrases A, B, and C a lot, then they're similar
in that sense. But to understand the level of confidence in their
similarity, one then should flip it around and analyze the database
contents, and find users who use phrases A, B, and C a lot, and then
see if the number and distribution of users implies that the
similarity in that aspect is truly uncommon or not uncommon.
It's been baldly asserted by a number of people that "Of course"
similarities imply linkage - the reverse analysis shows how likely the
linkage is to be causal as opposed to merely statistical. It may well
be that 5% of total users fit profile of "uses A, B, and C", and that
we would statistically *expect* that out of (these numbers
representative but made up) about 1 million US users, we would find
about say 3,000 in New York City, of whom about 150 would also use
phrases A, B, and C, and about say 25 of whom would share a similar
editing time profile. Given a whole US population, the odds would
then be high that somewhere, in some metro area, are two factually
unrelated people who use a given set of similar phrase patterns, the
same edit time of day profile, and some overlapping article interests.
Those are made up numbers - however, they might well be pretty close
to reality (or not - I haven't run them, and statistical guesses are a
really bad analytical tool 8-).
3. This whole incident has several aspects which are very unlike
previous sockpuppet analysis I am aware of. In a sense, this is good
- we're starting to develop some tools to determine analytically stuff
that previously has been gut feelings. It's also showing that we have
some disagreement as to the validity of the statistical methods, and
experience with statistics and analysis.
In my opinion, having done statistics in college, in workplaces, and
in scientific analysis from time to time, the investigative methods
used so far are applicable but have not been applied with sufficient
statistical depth and rigor... Yet. I don't see any reason why people
can't perform the rest of the research to answer these other
questions, and I think that we can eventually reach the point that my
doubts are answered from a truly statistical likelyhood sense of
"proving the case".
I believe that IF you intend to use statistical methods to try and
prove the case, the burden of proof should be on those "prosecuting"
the case to convince skeptics that the statistical analysis has gotten
good enough. I am not yet convinced, from a statistical sense. There
are too many unknowns, and the rigor of the work isn't there yet. We
don't yet have agreement on what level of rigor and testing is
required. But I think we can find that agreement.
4. All of that said, through the haze, there's a sense that a duck has
How we handle a combination of duck test feeling and statistical
findings to reach a final conclusion is up to arbcom and community
-george william herbert