[WikiEN-l] "How the Professor Who Fooled Wikipedia Got Caught by Reddit", _The Atlantic_

Mon May 21 22:33:05 UTC 2012

On Mon, May 21, 2012 at 6:02 PM, Gwern Branwen <gwern0 at gmail.com> wrote:
> On Mon, May 21, 2012 at 5:32 PM, Anthony <wikimail at inbox.org> wrote:
>> How could we do that?  You could have just cherrypicked the worst
>> links that were last links which are not official or
>> template-generated in External Link sections.  I'm not saying I think
>> you did that.  But you certainly could have.
>
> Cherrypicking even under this strategy would force me to do both >2x
> as much work and engage in conscious deception.

Yes.  I'm not saying I think you did that.  It never crossed my mind
that you might have intentionally tried to bias the sample, until you
said "anyone will be able to check whether I did".  We can't check.
We simply have to trust you that you picked the links in the way that
you claim to have picked the links.

In any case, it really doesn't matter, because your sample *was*
biased, regardless of your intention.

>> Anyway, the main thing I'd like to say about all of this is simply
>> that your selection is not random.  Your sample is biased.  Biased in
>> which direction, I don't know.  Biased intentionally, I doubt.  But
>> your sample is biased.
>
> Sheesh. Every sample is biased in many ways - but random samples are
> biased in unpredictable ways, which is why randomizing was such a big
> innovation when Fisher and his contemporaries introduced it. What's
> next, PRNGs are unacceptable for any kind of study because you can
> predict each output if you know the seed and run the PRNG
> appropriately?

You should read more about sampling bias.  Or talk to someone who has.

PRNGs are acceptable, though you do have to be careful to avoid
publication bias.

If you took a list of all external links, and then used a PRNG to pick
100 numbers between 1 and N (the number of links), and then removed
those external links, then you would have a random sample.  The fact
that you can predict each output if you know the seed and run the PRNG
appropriately would only come into play if you ran the test several
times, with different seeds, and selected one of the runs.

By picking articles first, then picking links, you introduce bias.
You are biasing your links toward those which are in articles with
fewer links.  These are probably less likely to be noticed when
removed, because articles with lots of links are more likely to be on
watchlists, and tend to have more objective criteria.  By limiting
yourself to links in the External Links section, you introduce bias.
These links tend to be the least useful, as they are essentially
miscellanea.  By limiting yourself to links which are not official,
you introduce bias.  This one is pretty obvious, I think, and it is
one introduction of bias which I think you did intentionally.  The
removal of official links is quite clearly more likely to be reverted.
 By limiting yourself to links in articles with more than one external
link, and only to links which are not template-generated, you
introduce bias.  You pretty much admit this, and admit that the bias
was intentional ("avoids issues where pages might have 5 or 10
'official' external links to various versions or localizations, all of
which an editor could confidently and blindly revert the removal of;
template-generated links also carry imprimaturs of authority").

All of this is fine, by the way, depending on what your intention was
to show.  If it was to show that a certain type of external link can
be removed without likely being reverted, then your methodology is
fine.  But then you shouldn't advertise your experiment as "the
removal of 100 random external links", because that is not what you
did.