[WikiEN-l] Samiharris looks like a sock, and everyone suspected

Thu Feb 14 01:34:48 UTC 2008

On Feb 13, 2008 11:43 AM, Cool Hand Luke
<failure.to.communicate at gmail.com> wrote:
> > Having said that! Focusing on the actual evidence here - I'm not a
> > mathematician or an expert in statistical analysis, and few of the
> > folks working on this analytical project are either. However, it is
> > clear to me that the sample suffers from a number of mathematical
> > problems - mostly relating to its size and selection, and the
> > significance attached to the results. If you want to make a
> > comprehensive declaration based on this type of analysis, you need a
> > much more robust set of data to work with and a much more serious
> > approach to mining it for significant data points.
> >
> > Nathan
> >
> >
> >
> As far as I know, no one has ever thought to do what I've done here. The
> SevenOfDiamonds case was happy just to list similar traits and plot a
> similar graph of editing patterns between the users.  That wasn't enough for
> me: I wanted to know how meaningful similar editing patterns and traits
> are.  So I did surveys with data as random as I could, and no one else is
> even remotely similar.  You would have that count against the data.  GWH
> protested that I didn't include enough similarly situated (ie, presumably
> New Yorker) editors.  So I grabbed some of them too, and still nothing is as
> close as these two.  It seems that some editors would not be satisfied until
> every last Wikipedian is compared.
>
> That simply isn't necessary.  Their editing patterns match well, and are
> rare (probably less than 1-in-21 based on what I've looked at so far, and
> that's while I'm trying to find users who will match). Many of their editing
> traits are shared by almost no other accounts (1-in-10 or less for at least
> a half dozen or so traits). If these variables are even somewhat
> independent, they multiply together into very long odds.
>
> And the interleaving is really compelling based on the editing style. GWH
> protested that my sample users don't resemble their editing patterns well
> enough (as indeed no accounts do), but by comparing an editor who makes a
> significant proportion of their edits while Mantanmoreland doesn't edit, I
> would have expected to find LESS intersections, not more. These users never
> edited at the same time as each other, and this is at least eyebrow-raising.
>
> All of these things together are damning--and that's without even keeping
> our eye on the ball--that these users shared POV, that Samiharris started
> editing Wiess within one day of Mantanmoreland quit, and that Mantanmoreland
> knew his edits would be monitored, so had a motive to spin a sock for
> possible COI abuse.
>
> Quack.
>
> Cool Hand Luke

Some points -

1.  A lot of the statistical stuff still needs a lot of work.  The
section 13 analysis really does need re-running with editors who are
whole-day better matches to the edit pattern, for example.    You and
others brushed this off - that's not a reasonable response.  At this
point section 13 is faulty.

2. *All* of the statistical stuff, particularly the user phrase
analysis, should be flipped around and run the other way to generate
anything like a solid profile.  If user X uses phrases A, B, and C a
lot, and user Y uses phrases A, B, and C a lot, then they're similar
in that sense.  But to understand the level of confidence in their
similarity, one then should flip it around and analyze the database
contents, and find users who use phrases A, B, and C a lot, and then
see if the number and distribution of users implies that the
similarity in that aspect is truly uncommon or not uncommon.

It's been baldly asserted by a number of people that "Of course"
similarities imply linkage - the reverse analysis shows how likely the
linkage is to be causal as opposed to merely statistical.  It may well
be that 5% of total users fit profile of "uses A, B, and C", and that
we would statistically *expect* that out of (these numbers
representative but made up) about 1 million US users, we would find
about say 3,000 in New York City, of whom about 150 would also use
phrases A, B, and C, and about say 25 of whom would share a similar
editing time profile.  Given a whole US population, the odds would
then be high that somewhere, in some metro area, are two factually
unrelated people who use a given set of similar phrase patterns, the
same edit time of day profile, and some overlapping article interests.

Those are made up numbers - however, they might well be pretty close
to reality (or not - I haven't run them, and statistical guesses are a
really bad analytical tool 8-).

3. This whole incident has several aspects which are very unlike
previous sockpuppet analysis I am aware of.  In a sense, this is good
- we're starting to develop some tools to determine analytically stuff
that previously has been gut feelings.  It's also showing that we have
some disagreement as to the validity of the statistical methods, and
experience with statistics and analysis.

In my opinion, having done statistics in college, in workplaces, and
in scientific analysis from time to time, the investigative methods
used so far are applicable but have not been applied with  sufficient
statistical depth and rigor... Yet.  I don't see any reason why people
can't perform the rest of the research to answer these other
questions, and I think that we can eventually reach the point that my
doubts are answered from a truly statistical likelyhood sense of
"proving the case".

I believe that IF you intend to use statistical methods to try and
prove the case, the burden of proof should be on those "prosecuting"
the case to convince skeptics that the statistical analysis has gotten
good enough.  I am not yet convinced, from a statistical sense.  There
are too many unknowns, and the rigor of the work isn't there yet.  We
don't yet have agreement on what level of rigor and testing is
required.  But I think we can find that agreement.

4. All of that said, through the haze, there's a sense that a duck has
been sighted.

How we handle a combination of duck test feeling and statistical
findings to reach a final conclusion is up to arbcom and community
consensus.

-- 
-george william herbert
george.herbert at gmail.com