[WikiEN-l] Captcha word quiz

Neil Harris neil at tonal.clara.co.uk
Mon Mar 20 16:04:41 UTC 2006


Theresa Knott wrote:
> On 3/20/06, Neil Harris <neil at tonal.clara.co.uk> wrote:
>   
>> Theresa Knott wrote:
>>     
>>> On 3/20/06, Steve Bennett <stevage at gmail.com> wrote:
>>>
>>>       
>>>> On 3/20/06, Theresa Knott <theresaknott at gmail.com> wrote:
>>>>
>>>>         
>>>>> Sorry i thought a catchpa was a wiggly word  _image_.  What I am
>>>>> descirbing could easily be text.
>>>>>
>>>>>           
>>>> Invent one then ;) Bear in mind that if it's multiple choice, then the
>>>> robot could just have a few goes.
>>>>
>>>>         
>>> A colour that's the opposite of black
>>>
>>> the number of days in a week
>>>
>>> A pet animal that goes woof woof.
>>>
>>>
>>>       
>> Unfortunately, Googling and word-counting is a rather powerful way of
>> cheating at these sorts of simple general-knowledge questions:
>>
>> Searching for "animal that goes woof woof". removing trivial words from
>> the page fragments Google returns on its search page, and then counting
>> the most common remaining words gives the following:
>>
>> 43: woof
>> 9: animal
>> 8: dog
>> 7: goes
>> 6: joke
>>
>> for "colour that's the opposite of black" we get:
>>
>> 21: colour
>> 10: black
>> 6: opposite
>> 6: white
>> 5: you
>>
>>
>> If you then try the words which are not in the question, the highest
>> rated few words tend to contain the answer to the question. Even if we
>> try at random from the top-rated words, there's a good chance of
>> success, which is all a bot needs.
>>
>>     
>
>
> OK so simple quiz questions are out, and so are multiple choice
> questions. What about this:
>
> Take the second letter of ADAM, the first letter of  OGRE and the last
> of RAINING.
>
>
> Is it possible for a bot to get around that?
>
>
> Theresa
>
>   
On its own, probably no. If you start using thousands of questions of 
the same form, though, someone will write a simple program to do them, 
and it then becomes trivial.

I tried to think of a good text-only captcha scheme some time ago, but 
came up short.

The ideal text captcha is:
* endlessly variable (there must be at least millions of potential 
challenges, to defend against replay attacks)
* easy for people to answer without any specialist knowledge
* easy to answer for people without advanced skills in the target language
* not generated by a simple algorithm which can be reverse-engineered 
(as with the above)
* not Googlable
* easy to assess the answer using a computer program (which typically 
means it's a simple word or phrase)

It's hard to generate large numbers of questions which are easy for 
people to answer, but hard for machines. For a start, questions about 
obscure topics are a test of general knowledge, not humanity and many 
people will fail them. Questions with contorted syntax will be difficult 
for non-native speakers.

Questions of the form "what is the capital of X" are easily dealt with 
by simple lookup, or Googling. In general, any database of facts you can 
find in public to pose questions can also be found by a spammer.

Algorithmically-generated questions are vulnerable to reverse 
engineering: questions which require the reader to perform simple symbol 
manipulations or answer auto-generated logic puzzles are easily 
performed by computer. Devious riddles like the Riddle of the Sphinx 
will stump most readers if they do not already know the answer, and all 
the common ones are Googlable anyway. Questions with ambiguous answers 
are hard to mark correctly using a computer.

Even assuming a carefully compiled list of, say, 1000 suitable questions 
that avoid all these pitfalls could be composed by hand (perhaps by a 
group effort), a spammer would only need to build a list of them once, 
and they could then be answered perfectly by simple lookup.

The nice things about visual captchas is that the operations used to 
create them are effectively one-way: for example, stirring the pixels in 
the current Wikipedia captchas is easy to do, but hard to invert 
programmatically, yet the human eye can still decode it. What's needed 
is a similar operation for text that uses the power of the human 
cognitive system in the same way that visual captchas use the power of 
the human visual system.

For example, one good class of anti-bot precautions uses Javascript, and 
works on the principle that most bot authors cannot be bothered to 
include a Javascript interpreter in their bot, but that every modern 
browser is capable of interpreting Javascript without the user needing 
to do anything special.

Similarly, you can slow down spammers by creating a computation burden 
by requiring the far end to generate hash collisions, something that 
can't be done without a powerful computer at the other end working away 
for some time. In fact, it's probably easier to create questions that 
machines can answer, but people can't.

What's needed is something that exercises a uniquely human skill that 
only involves understanding. Perhaps story understanding? Or reasoning 
about hidden emotions or mental states, both things that people have 
evolved to do very well? (Note that many real people with autistic 
spectrum disorder won't be able to answer these questions, though).

-- Neil




More information about the WikiEN-l mailing list