Theresa Knott wrote:
On 3/20/06, Neil Harris neil@tonal.clara.co.uk wrote:
Theresa Knott wrote:
On 3/20/06, Steve Bennett stevage@gmail.com wrote:
On 3/20/06, Theresa Knott theresaknott@gmail.com wrote:
Sorry i thought a catchpa was a wiggly word _image_. What I am descirbing could easily be text.
Invent one then ;) Bear in mind that if it's multiple choice, then the robot could just have a few goes.
A colour that's the opposite of black
the number of days in a week
A pet animal that goes woof woof.
Unfortunately, Googling and word-counting is a rather powerful way of cheating at these sorts of simple general-knowledge questions:
Searching for "animal that goes woof woof". removing trivial words from the page fragments Google returns on its search page, and then counting the most common remaining words gives the following:
43: woof 9: animal 8: dog 7: goes 6: joke
for "colour that's the opposite of black" we get:
21: colour 10: black 6: opposite 6: white 5: you
If you then try the words which are not in the question, the highest rated few words tend to contain the answer to the question. Even if we try at random from the top-rated words, there's a good chance of success, which is all a bot needs.
OK so simple quiz questions are out, and so are multiple choice questions. What about this:
Take the second letter of ADAM, the first letter of OGRE and the last of RAINING.
Is it possible for a bot to get around that?
Theresa
On its own, probably no. If you start using thousands of questions of the same form, though, someone will write a simple program to do them, and it then becomes trivial.
I tried to think of a good text-only captcha scheme some time ago, but came up short.
The ideal text captcha is: * endlessly variable (there must be at least millions of potential challenges, to defend against replay attacks) * easy for people to answer without any specialist knowledge * easy to answer for people without advanced skills in the target language * not generated by a simple algorithm which can be reverse-engineered (as with the above) * not Googlable * easy to assess the answer using a computer program (which typically means it's a simple word or phrase)
It's hard to generate large numbers of questions which are easy for people to answer, but hard for machines. For a start, questions about obscure topics are a test of general knowledge, not humanity and many people will fail them. Questions with contorted syntax will be difficult for non-native speakers.
Questions of the form "what is the capital of X" are easily dealt with by simple lookup, or Googling. In general, any database of facts you can find in public to pose questions can also be found by a spammer.
Algorithmically-generated questions are vulnerable to reverse engineering: questions which require the reader to perform simple symbol manipulations or answer auto-generated logic puzzles are easily performed by computer. Devious riddles like the Riddle of the Sphinx will stump most readers if they do not already know the answer, and all the common ones are Googlable anyway. Questions with ambiguous answers are hard to mark correctly using a computer.
Even assuming a carefully compiled list of, say, 1000 suitable questions that avoid all these pitfalls could be composed by hand (perhaps by a group effort), a spammer would only need to build a list of them once, and they could then be answered perfectly by simple lookup.
The nice things about visual captchas is that the operations used to create them are effectively one-way: for example, stirring the pixels in the current Wikipedia captchas is easy to do, but hard to invert programmatically, yet the human eye can still decode it. What's needed is a similar operation for text that uses the power of the human cognitive system in the same way that visual captchas use the power of the human visual system.
For example, one good class of anti-bot precautions uses Javascript, and works on the principle that most bot authors cannot be bothered to include a Javascript interpreter in their bot, but that every modern browser is capable of interpreting Javascript without the user needing to do anything special.
Similarly, you can slow down spammers by creating a computation burden by requiring the far end to generate hash collisions, something that can't be done without a powerful computer at the other end working away for some time. In fact, it's probably easier to create questions that machines can answer, but people can't.
What's needed is something that exercises a uniquely human skill that only involves understanding. Perhaps story understanding? Or reasoning about hidden emotions or mental states, both things that people have evolved to do very well? (Note that many real people with autistic spectrum disorder won't be able to answer these questions, though).
-- Neil