Il 10/11/2014 17:23, Chris Steipp ha scritto:
On the general topic, I think either a captcha or
verifying an email makes
a small barrier to building a bot, but it's significant enough that it
keeps the amateur bots out. I'd be very interested in seeing an experiment
run to see what the exact impact is though.
Google had a great blog post on this subject where they made recaptcha
easier to solve, and instead,
"The updated system uses advanced risk analysis techniques, actively
considering the user's entire engagement with the CAPTCHA--before, during
and after they interact with it. That means that today the distorted
letters serve less as a test of humanity and more as a medium of engagement
to elicit a broad range of cues that characterize humans and bots. " 
So spending time on a new engine that allows for environmental feedback
from the system solving the captcha, and that lets us tune lots of things
besides did the "user" sending back the right string of letters, I think
would be well worth our time.
Il 04/12/2014 05:35, Robert Rohde ha scritto:
We have many smart people, and undoubtedly we could
design a better captcha.
However, no matter how smart the mousetrap, as long as you leave it strewn
around the doors and hallways, well-meaning people are going to trip over
I would support removing the captcha from generic entry points, like the
account registration page, where we know many harmless people are
However, captchas might be useful if used in conjunction with simple
behavioral analysis, such as rate limiters. For example, if an IP is
creating a lot of accounts or editing at a high rate of speed, those are
bad signs. Adding the same external link to multiple pages is often a very
bad sign. However, adding a link to the NYTimes or CNN or an academic
journal is probably fine. With that in mind, I would also eliminate the
external link captcha in most cases where a link has only been added once
and try to be more intelligent about which sites trigger it otherwise.
Basically, I'd advocate a strategy of adding a few heuristics to try and
figure out who the mice are before putting the mousetraps in front of
them. Of course, the biggest rats will still break the captcha and get
through, but that is already true. Though reducing the prevalence of the
captcha may increase the volume of spam by some small measure, I think it
is more important that we stop erecting so many hurdles to new editors.
Il 05/12/2014 06:28, Robert Rohde ha scritto:
I suspect that a lot of the spam are the obvious
things such as external
links to junk sites and repetitive promotional postings, though perhaps
there are also less obvious types of spam?
I suspect we could weed out a lot of spammy link behavior by designing an
external link classifier that used knowledge of what external links are
frequently included and what external links are frequently removed to
generate automatic good / suspect / bad ratings for new external links (or
domains). Good links (e.g. NYTimes, CNN) might be automatically allowed
for all users, suspect links (e.g. unknown or rarely used domains) might be
automatically allowed for established users and challenged with captchas or
other tools for new users / IPs, and bad links (i.e. those repeatedly
spammed and removed) could be automatically detected and blocked.
What about applying ClueBot NG's Vandalism Detection Algorithm
At this point I think machine learning is the only way a real CAPTCHA
can keep up with evil bots, and a text-based system (such as T34695
<https://phabricator.wikimedia.org/T34695>) would only be used for
tuning, just as reCAPTCHA does.