[WikiEN-l] Captcha word quiz
Neil Harris
neil at tonal.clara.co.uk
Mon Mar 20 16:04:41 UTC 2006
Theresa Knott wrote:
> On 3/20/06, Neil Harris <neil at tonal.clara.co.uk> wrote:
>
>> Theresa Knott wrote:
>>
>>> On 3/20/06, Steve Bennett <stevage at gmail.com> wrote:
>>>
>>>
>>>> On 3/20/06, Theresa Knott <theresaknott at gmail.com> wrote:
>>>>
>>>>
>>>>> Sorry i thought a catchpa was a wiggly word _image_. What I am
>>>>> descirbing could easily be text.
>>>>>
>>>>>
>>>> Invent one then ;) Bear in mind that if it's multiple choice, then the
>>>> robot could just have a few goes.
>>>>
>>>>
>>> A colour that's the opposite of black
>>>
>>> the number of days in a week
>>>
>>> A pet animal that goes woof woof.
>>>
>>>
>>>
>> Unfortunately, Googling and word-counting is a rather powerful way of
>> cheating at these sorts of simple general-knowledge questions:
>>
>> Searching for "animal that goes woof woof". removing trivial words from
>> the page fragments Google returns on its search page, and then counting
>> the most common remaining words gives the following:
>>
>> 43: woof
>> 9: animal
>> 8: dog
>> 7: goes
>> 6: joke
>>
>> for "colour that's the opposite of black" we get:
>>
>> 21: colour
>> 10: black
>> 6: opposite
>> 6: white
>> 5: you
>>
>>
>> If you then try the words which are not in the question, the highest
>> rated few words tend to contain the answer to the question. Even if we
>> try at random from the top-rated words, there's a good chance of
>> success, which is all a bot needs.
>>
>>
>
>
> OK so simple quiz questions are out, and so are multiple choice
> questions. What about this:
>
> Take the second letter of ADAM, the first letter of OGRE and the last
> of RAINING.
>
>
> Is it possible for a bot to get around that?
>
>
> Theresa
>
>
On its own, probably no. If you start using thousands of questions of
the same form, though, someone will write a simple program to do them,
and it then becomes trivial.
I tried to think of a good text-only captcha scheme some time ago, but
came up short.
The ideal text captcha is:
* endlessly variable (there must be at least millions of potential
challenges, to defend against replay attacks)
* easy for people to answer without any specialist knowledge
* easy to answer for people without advanced skills in the target language
* not generated by a simple algorithm which can be reverse-engineered
(as with the above)
* not Googlable
* easy to assess the answer using a computer program (which typically
means it's a simple word or phrase)
It's hard to generate large numbers of questions which are easy for
people to answer, but hard for machines. For a start, questions about
obscure topics are a test of general knowledge, not humanity and many
people will fail them. Questions with contorted syntax will be difficult
for non-native speakers.
Questions of the form "what is the capital of X" are easily dealt with
by simple lookup, or Googling. In general, any database of facts you can
find in public to pose questions can also be found by a spammer.
Algorithmically-generated questions are vulnerable to reverse
engineering: questions which require the reader to perform simple symbol
manipulations or answer auto-generated logic puzzles are easily
performed by computer. Devious riddles like the Riddle of the Sphinx
will stump most readers if they do not already know the answer, and all
the common ones are Googlable anyway. Questions with ambiguous answers
are hard to mark correctly using a computer.
Even assuming a carefully compiled list of, say, 1000 suitable questions
that avoid all these pitfalls could be composed by hand (perhaps by a
group effort), a spammer would only need to build a list of them once,
and they could then be answered perfectly by simple lookup.
The nice things about visual captchas is that the operations used to
create them are effectively one-way: for example, stirring the pixels in
the current Wikipedia captchas is easy to do, but hard to invert
programmatically, yet the human eye can still decode it. What's needed
is a similar operation for text that uses the power of the human
cognitive system in the same way that visual captchas use the power of
the human visual system.
For example, one good class of anti-bot precautions uses Javascript, and
works on the principle that most bot authors cannot be bothered to
include a Javascript interpreter in their bot, but that every modern
browser is capable of interpreting Javascript without the user needing
to do anything special.
Similarly, you can slow down spammers by creating a computation burden
by requiring the far end to generate hash collisions, something that
can't be done without a powerful computer at the other end working away
for some time. In fact, it's probably easier to create questions that
machines can answer, but people can't.
What's needed is something that exercises a uniquely human skill that
only involves understanding. Perhaps story understanding? Or reasoning
about hidden emotions or mental states, both things that people have
evolved to do very well? (Note that many real people with autistic
spectrum disorder won't be able to answer these questions, though).
-- Neil
More information about the WikiEN-l
mailing list