Captcha readibility

List overview All Threads
Download

newer

older

Arabic OTRS encoding

Any interest in participating in...

Platonides

7 Oct 2007 7 Oct '07

6:27 p.m.

It has been discussed here before about the captchas which are too hard to pass. However, without samples. Today i found one of these captchas. I read ghooktrust but mediawiki didn't agree. The first letter could be a 5, but we don't use numbers. So i now finally noticed it might be an s

Anyone wishing to test its humaness?

http://upload.wikimedia.org/wikipedia/test/f/f5/WpCaptchaId_734931244.gif

Show replies by date

Simetrical

7 Oct 7 Oct

6:37 p.m.

On 10/7/07, Platonides Platonides@gmail.com wrote:

...

It has been discussed here before about the captchas which are too hard to pass. However, without samples. Today i found one of these captchas. I read ghooktrust but mediawiki didn't agree. The first letter could be a 5, but we don't use numbers. So i now finally noticed it might be an s

Well, the captcha always consists of two words concatenated together, I do believe. "Shook" is a rather obscure word, however. Perhaps the dictionary could be made less comprehensive. Although that brings us back to non-English speakers, who won't be helped at all.

It could be either, yes, looking at it. But if you refresh it gives you a different captcha, right?

Soo Reams

6:47 p.m.

Simetrical wrote:

...

On 10/7/07, Platonides Platonides@gmail.com wrote:

...
It has been discussed here before about the captchas which are too hard to pass. However, without samples. Today i found one of these captchas. I read ghooktrust but mediawiki didn't agree. The first letter could be a 5, but we don't use numbers. So i now finally noticed it might be an s

Well, the captcha always consists of two words concatenated together, I do believe. "Shook" is a rather obscure word, however. Perhaps the dictionary could be made less comprehensive. Although that brings us back to non-English speakers, who won't be helped at all.

At the risk of sounding foolish, isn't "shook" a rather common word and "ghook", well, not a word at all?

Soo

Thomas Dalton

6:51 p.m.

...

At the risk of sounding foolish, isn't "shook" a rather common word and "ghook", well, not a word at all?

Not foolish at all. "Shook" is the past participle of "shake", a perfectly common word. "Ghook" is not a word I'm familiar with. However, I agree that the first letter does look more like a 'g' than an 's'.

Simetrical

6:57 p.m.

On 10/7/07, Soo Reams soo@sooreams.com wrote:

...

At the risk of sounding foolish, isn't "shook" a rather common word and "ghook", well, not a word at all?

Er, right, "shook" is quite a common word. I wasn't thinking of past-tense verbs when I read it, somehow.

Jay R. Ashworth

7:07 p.m.

On Sun, Oct 07, 2007 at 11:47:30PM +0100, Soo Reams wrote:

...

Simetrical wrote:

...
On 10/7/07, Platonides Platonides@gmail.com wrote:

...
It has been discussed here before about the captchas which are too hard to pass. However, without samples. Today i found one of these captchas. I read ghooktrust but mediawiki didn't agree. The first letter could be a 5, but we don't use numbers. So i now finally noticed it might be an s

Well, the captcha always consists of two words concatenated together, I do believe. "Shook" is a rather obscure word, however. Perhaps the dictionary could be made less comprehensive. Although that brings us back to non-English speakers, who won't be helped at all.

At the risk of sounding foolish, isn't "shook" a rather common word and "ghook", well, not a word at all?

Well, shook is a reasonably common word *in English*

But that's *Hebrew* around it, no?

Cheers, -- jra

-- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com '87 e24 St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

Gregory Maxwell

6:50 p.m.

On 10/7/07, Simetrical Simetrical+wikilist@gmail.com wrote:

...

On 10/7/07, Platonides Platonides@gmail.com wrote:

...
It has been discussed here before about the captchas which are too hard to pass. However, without samples. Today i found one of these captchas. I read ghooktrust but mediawiki didn't agree. The first letter could be a 5, but we don't use numbers. So i now finally noticed it might be an s

Well, the captcha always consists of two words concatenated together, I do believe. "Shook" is a rather obscure word, however. Perhaps the dictionary could be made less comprehensive. Although that brings us back to non-English speakers, who won't be helped at all.

It could be either, yes, looking at it. But if you refresh it gives you a different captcha, right?

We should change to random characters: Using dictionary words, even a 'secret' dictionary, substantially reduces the entropy of the captchas. Yes, the dictionary makes the captcha easier for humans but it's an even bigger help to computers which can fit much more accurate state transition models in their memory.

The goal of the captcha should be to maximize the gap between humans and computers, the goal should not be to be maximally hard.

Right now our captcha is weak by standard wisdom: the characters are too easily segmented. A tuned copy of the tesseract 2.0 OCR without any statistical modeling can recognize about 25% the letters in most of the Wikimedia captchas. Thats still pretty far from cracking it, but I bet someone skilled at captcha cracking wouldn't have too hard a time.

The captcha generator is a really simple python script that is easy and fun to modify. I made a copy here that distorts the text less but packs the characters closer together and adds a wiggly connecting line which is popular these days. The result is easier to read, making the use of mostly random characters acceptable and it completely defeats tessearct ... but I can't prove that it's not massively less secure against some other attack so I haven't proposed that we use it. :(

Neil Harris

8 Oct 8 Oct

6 a.m.

Gregory Maxwell wrote:

...

On 10/7/07, Simetrical Simetrical+wikilist@gmail.com wrote:

...
On 10/7/07, Platonides Platonides@gmail.com wrote:

...
It has been discussed here before about the captchas which are too hard to pass. However, without samples. Today i found one of these captchas. I read ghooktrust but mediawiki didn't agree. The first letter could be a 5, but we don't use numbers. So i now finally noticed it might be an s

Well, the captcha always consists of two words concatenated together, I do believe. "Shook" is a rather obscure word, however. Perhaps the dictionary could be made less comprehensive. Although that brings us back to non-English speakers, who won't be helped at all.

It could be either, yes, looking at it. But if you refresh it gives you a different captcha, right?

We should change to random characters: Using dictionary words, even a 'secret' dictionary, substantially reduces the entropy of the captchas. Yes, the dictionary makes the captcha easier for humans but it's an even bigger help to computers which can fit much more accurate state transition models in their memory.

The goal of the captcha should be to maximize the gap between humans and computers, the goal should not be to be maximally hard.

Right now our captcha is weak by standard wisdom: the characters are too easily segmented. A tuned copy of the tesseract 2.0 OCR without any statistical modeling can recognize about 25% the letters in most of the Wikimedia captchas. Thats still pretty far from cracking it, but I bet someone skilled at captcha cracking wouldn't have too hard a time.

The captcha generator is a really simple python script that is easy and fun to modify. I made a copy here that distorts the text less but packs the characters closer together and adds a wiggly connecting line which is popular these days. The result is easier to read, making the use of mostly random characters acceptable and it completely defeats tessearct ... but I can't prove that it's not massively less secure against some other attack so I haven't proposed that we use it. :(

As the author of the original Python captcha script, I'd like to say that this sounds like an excellent idea.

Could you post the source, please?

The design rationale for the current version is to resist captcha-defeating segmentation-independent edge-slope OCR by randomizing the edge slopes of characters a lot, whilst distorting overall character and word shapes rather less. I'm a bit disappointed that Tesseract is doing so well on the output of the existing code.

Your technique for resisting more conventional OCR by preventing segmentation is complementary to this, and, as you say, should defeat Tesseract quite effectively. If it provides a more readable captcha without loss of security, and it's readable enough to allow for random characters rather than relying on whole-word recognition to compensate for reducing per-character readability, we should consider putting it into use right away.

If we do this, we should also keep the existing captcha source in reserve, even if it's weaker; the more variants we have on the captcha algorithm, the more defence in depth we will have against attackers, both in terms of changing algorithms quickly in production if the currently-used method is compromised, and providing a base for rapid development if all the existing variants are ever compromised at once.)

It also might be worth experimenting with retaining the high-frequency character-edge disruption of the old code, whilst adopting your approach for the rest of the captcha.

-- Neil

Mohamed Magdy

7 Oct 7 Oct

6:51 p.m.

Simetrical wrote:

...

On 10/7/07, Platonides Platonides@gmail.com wrote:

...
It has been discussed here before about the captchas which are too hard to pass. However, without samples. Today i found one of these captchas. I read ghooktrust but mediawiki didn't agree. The first letter could be a 5, but we don't use numbers. So i now finally noticed it might be an s

Well, the captcha always consists of two words concatenated together, I do believe. "Shook" is a rather obscure word, however. Perhaps the dictionary could be made less comprehensive. Although that brings us back to non-English speakers, who won't be helped at all.

About those :)

Is it possible to have captcha images in another languages? i.e. captcha internationalization.

--wm:alnokta

Rob Church

6:54 p.m.

On 07/10/2007, Mohamed Magdy mohamed.m.k@gmail.com wrote:

...

Is it possible to have captcha images in another languages? i.e. captcha internationalization.

It's quite possible, and wouldn't necessarily be too difficult; it just hasn't been done yet.

Rob Church

Gregory Maxwell

7 p.m.

On 10/7/07, Rob Church robchur@gmail.com wrote:

...

On 07/10/2007, Mohamed Magdy mohamed.m.k@gmail.com wrote:

...
Is it possible to have captcha images in another languages? i.e. captcha internationalization.

It's quite possible, and wouldn't necessarily be too difficult; it just hasn't been done yet.

We pretty much just need a secret set of non-offensive recognizable words in the language which are comprised only of characters in the font that we're using (and the font could be changed easily enough).

... but I think it would be better to use random character captchas, and automatically extract acceptable characters using the top 20 characters used in articles titles, perhaps with a manually adjusted confusing character exception list. There would need to be some code so you could ask for a language-foo captcha while on langauge-bar Wikipedia.

Jay R. Ashworth

7:08 p.m.

On Sun, Oct 07, 2007 at 07:00:42PM -0400, Gregory Maxwell wrote:

...

On 10/7/07, Rob Church robchur@gmail.com wrote:

...
On 07/10/2007, Mohamed Magdy mohamed.m.k@gmail.com wrote:

...
Is it possible to have captcha images in another languages? i.e. captcha internationalization.

It's quite possible, and wouldn't necessarily be too difficult; it just hasn't been done yet.

We pretty much just need a secret set of non-offensive recognizable words in the language which are comprised only of characters in the font that we're using (and the font could be changed easily enough).

... but I think it would be better to use random character captchas, and automatically extract acceptable characters using the top 20 characters used in articles titles,

And you do that on zh... how? :-)

Cheers, -- jra

-- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com '87 e24 St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

Gregory Maxwell

7:10 p.m.

On 10/7/07, Jay R. Ashworth jra@baylink.com wrote:

...

And you do that on zh... how? :-)

So, there would be a few special cases. ;) It would still be a big step ahead from where we are now.

Mohamed Magdy

7:23 p.m.

Gregory Maxwell wrote:

...

On 10/7/07, Rob Church robchur@gmail.com wrote:

...
On 07/10/2007, Mohamed Magdy mohamed.m.k@gmail.com wrote:

...
Is it possible to have captcha images in another languages? i.e. captcha internationalization.

It's quite possible, and wouldn't necessarily be too difficult; it just hasn't been done yet.

Thats good news.

...

We pretty much just need a secret set of non-offensive recognizable words in the language which are comprised only of characters in the font that we're using (and the font could be changed easily enough).

That looks like a (job) to be done for numerous languages.

...

... but I think it would be better to use random character captchas, and automatically extract acceptable characters using the top 20 characters used in articles titles, perhaps with a manually adjusted confusing character exception list. There would need to be some code so you could ask for a language-foo captcha while on langauge-bar Wikipedia.

Either way is fine by me , we are entering randomness all over the net ;)

What do you mean by ask? wouldn't it be like, you are using interface xx, you get captcha xx? and when not logged in, sites xx ?

--alnokta

Tim Starling

8 Oct 8 Oct

8:22 a.m.

Mohamed Magdy wrote:

...

About those :)

Is it possible to have captcha images in another languages? i.e. captcha internationalization.

It's possible, but it would only solve half the problem -- the other half being that captchas fundamentally suck.

There is no way to make a captcha system universally accessible. You think it's hard to make a text captcha accessible to speakers of all languages? Try making an audio alternative for blind people with complete language coverage. But that still leaves the deafblind and people with certain mental handicaps out in the cold.

Eventually computers will surpass humans in their pattern matching and speech recognition abilities, making captchas obsolete. I think we should start developing tools for that day, instead of relying on this moribund technology.

-- Tim Starling

Anthony

8:43 a.m.

On 10/8/07, Tim Starling tstarling@wikimedia.org wrote:

...

Eventually computers will surpass humans in their pattern matching and speech recognition abilities, making captchas obsolete.

Captchas aren't limited to pattern matching and speech recognition. When computers catch up to humans enough to make captchas obsolete, it's time to let them write the encyclopedia.

...

I think we should start developing tools for that day, instead of relying on this moribund technology.

Pretty much all technology is moribund. If you can come up with a technology today which will solve the unknown problems of some indefinite point in the future, by all means let us know what they are.

Simetrical

12:13 p.m.

On 10/8/07, Anthony wikimail@inbox.org wrote:

...

Captchas aren't limited to pattern matching and speech recognition. When computers catch up to humans enough to make captchas obsolete, it's time to let them write the encyclopedia.

Except the point isn't finding some place where computers are stupider than people, it's finding some place where computers are stupider than people *and other computers can tell the difference*. You could ask the visitor to have a little chat with you, and thirty seconds would tell *you* the difference; but it wouldn't tell your computer anything. Computers can write encyclopedia articles that are perfectly good and high-quality . . . as far as other computers can tell.

In the not-so-distant future, I think we're going to have to give up on captchas altogether and just rely on some basic throttling, spam blacklists, and human oversight. But we aren't there yet. At the very least, captchas add an extra barrier to spam, for now.

Christensen, Courtney

12:57 p.m.

How about useful captchas? http://www.networkworld.com/community/?q=node/15522

I don't know if it is in use anywhere yet, but you use two captcha boxes. One is a control you know what the word is (like we use now), and the other is an unknown word that is trying to be digitized from a scanned book for example.

I guess that doesn't address the readability or security though.

David Gerard

1:01 p.m.

On 08/10/2007, Christensen, Courtney ChristensenC@battelle.org wrote:

...

How about useful captchas? http://www.networkworld.com/community/?q=node/15522 I don't know if it is in use anywhere yet, but you use two captcha boxes. One is a control you know what the word is (like we use now), and the other is an unknown word that is trying to be digitized from a scanned book for example. I guess that doesn't address the readability or security though.

I must admit that I hate captchas, but was actually pleased to see one of those captchas the first time I got one.

And it'd dovetail nicely with WMF's mission!

- d.

Brion Vibber

11 Oct 11 Oct

11:07 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

David Gerard wrote:

...

On 08/10/2007, Christensen, Courtney ChristensenC@battelle.org wrote:

...
How about useful captchas? http://www.networkworld.com/community/?q=node/15522 I don't know if it is in use anywhere yet, but you use two captcha boxes. One is a control you know what the word is (like we use now), and the other is an unknown word that is trying to be digitized from a scanned book for example. I guess that doesn't address the readability or security though.

I must admit that I hate captchas, but was actually pleased to see one of those captchas the first time I got one.

And it'd dovetail nicely with WMF's mission!

CMU refuses to open-source the software that runs that system, using vague justifications like "it would be hard to replicate the whole system, so surely no one would want the source".

That's not really the kind of software partner we prefer to work with.

- -- brion vibber (brion @ wikimedia.org)

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHDjw4wRnhpk1wk44RAjLhAKDJkyYElzo9JI3JlcsKttholUYqTgCfcuzD B6LQ/hNqhndX5XOM0hW7IoY= =uTUp -----END PGP SIGNATURE-----

David Gerard

12:19 p.m.

On 11/10/2007, Brion Vibber brion@wikimedia.org wrote:

...

David Gerard wrote:

...
On 08/10/2007, Christensen, Courtney ChristensenC@battelle.org wrote:

...

...
...
How about useful captchas? http://www.networkworld.com/community/?q=node/15522

...

...
I must admit that I hate captchas, but was actually pleased to see one of those captchas the first time I got one. And it'd dovetail nicely with WMF's mission!

...

CMU refuses to open-source the software that runs that system, using vague justifications like "it would be hard to replicate the whole system, so surely no one would want the source". That's not really the kind of software partner we prefer to work with.

Eurgh. That's ... silly. Do we have any readers with ins at CMU who can explain to them the advantages of having it on a top 10 website in a convincing and saleable manner?

- d.

Simetrical

1:57 p.m.

On 10/11/07, David Gerard dgerard@gmail.com wrote:

...

On 11/10/2007, Brion Vibber brion@wikimedia.org wrote:

...
CMU refuses to open-source the software that runs that system, using vague justifications like "it would be hard to replicate the whole system, so surely no one would want the source". That's not really the kind of software partner we prefer to work with.

Eurgh. That's ... silly. Do we have any readers with ins at CMU who can explain to them the advantages of having it on a top 10 website in a convincing and saleable manner?

It seems likely that if Wikipedia said it would run the captchas if everything were free and open-source, they would open-source it. I expect we wouldn't want to participate in any case if it required using their servers, which it seems to in the general case: would we want Wikipedia captchas to break if they have some downtime? Could we implement some kind of reliable way to substitute our own captchas should that occur? (Like what, ping their server on every captcha request before we give the response to the request? Slow *and* unreliable.)

Platonides

4:23 p.m.

Simetrical wrote:

...

On 10/11/07, David Gerard wrote:

...
On 11/10/2007, Brion Vibber wrote:

...
CMU refuses to open-source the software that runs that system, using vague justifications like "it would be hard to replicate the whole system, so surely no one would want the source". That's not really the kind of software partner we prefer to work with.

Eurgh. That's ... silly. Do we have any readers with ins at CMU who can explain to them the advantages of having it on a top 10 website in a convincing and saleable manner?

It seems likely that if Wikipedia said it would run the captchas if everything were free and open-source, they would open-source it. I expect we wouldn't want to participate in any case if it required using their servers, which it seems to in the general case: would we want Wikipedia captchas to break if they have some downtime? Could we implement some kind of reliable way to substitute our own captchas should that occur? (Like what, ping their server on every captcha request before we give the response to the request? Slow *and* unreliable.)

Another good reason to opensource it. Wiki*edia could replicate the basics and communicate with their servers in batches.

The answer is really weak. What's the problem if nobody wants it? Seems they're afraid someone copies them...

Gregory Maxwell

2:11 p.m.

On 10/11/07, David Gerard dgerard@gmail.com wrote:

...

Eurgh. That's ... silly. Do we have any readers with ins at CMU who can explain to them the advantages of having it on a top 10 website in a convincing and saleable manner?

To be fair ... there aren't all that many activities on WP that trigger the captcha: Account creation, login attempts after repeated failures, external links by anons.

I'd be somewhat surprised if we had more than 15,000 captchas solved per day. We're a traffic monster, but most of that traffic is reads.

Simetrical Simetrical+wikilist@gmail.com wrote:

...

I expect we wouldn't want to participate in any case if it required using their servers, which it seems to in the general case: would we want Wikipedia captchas to break if they have some downtime? Could we implement some kind of reliable way to substitute our own captchas should that occur?

There is also the question of who benefits from the recaptcha work: I've not seen a lot of real information on that.

It would be interesting if they published a large database of problem word images -> recaptcha validated recognitions to help people with OCR research, if nothing else.

Anthony

8 Oct 8 Oct

3:11 p.m.

On 10/8/07, Simetrical Simetrical+wikilist@gmail.com wrote:

...

On 10/8/07, Anthony wikimail@inbox.org wrote:

...
Captchas aren't limited to pattern matching and speech recognition. When computers catch up to humans enough to make captchas obsolete, it's time to let them write the encyclopedia.

Except the point isn't finding some place where computers are stupider than people, it's finding some place where computers are stupider than people *and other computers can tell the difference*. You could ask the visitor to have a little chat with you, and thirty seconds would tell *you* the difference; but it wouldn't tell your computer anything. Computers can write encyclopedia articles that are perfectly good and high-quality . . . as far as other computers can tell.

Well, maybe I'm wrong. Encyclopedia writing isn't itself a captcha, but I find it hard to believe we're going to be in a place where computers can read anything humans can and yet don't understand language. Humans can easily deal with missing letters and even missing words by using context clues and common sense. Once computers can read and understand language the task of writing an encyclopedia seems within reach. Even if not, the task of automated vandalism-fighting will likely improve enough to make captchas less necessary.

...

In the not-so-distant future, I think we're going to have to give up on captchas altogether and just rely on some basic throttling, spam blacklists, and human oversight. But we aren't there yet. At the very least, captchas add an extra barrier to spam, for now.

I mostly agree, for some value of "the not-so-distant future". Of course, I also think in the not-so-distant future the kind of contributions made by drive-by editors will be superceded by automated tools, making it much more reasonable to have a thirty-second chat required before editing can take place.

Mohamed Magdy

9 Oct 9 Oct

6:57 a.m.

Is it possible to generate video on the fly containing random characters and numbers and use it instead of the still images? or may be generate a .gif animations (with characters dancing up and down in a colorful way :))?

I assume that can slow ocr (forcing it to convert the video to still images first then do the ocr) but i don't know about video, may be it is possible to make the video gives garbage when converted to still images.. if you google video character recognition you get a lot of hits but i don't think spammers capable of developing vcr for their needs ;)..

-- user:alnokta

Gisle Sælensminde

7:58 a.m.

Mohamed Magdy wrote:

...

Is it possible to generate video on the fly containing random characters and numbers and use it instead of the still images? or may be generate a .gif animations (with characters dancing up and down in a colorful way :))?

I assume that can slow ocr (forcing it to convert the video to still images first then do the ocr) but i don't know about video, may be it is possible to make the video gives garbage when converted to still images.. if you google video character recognition you get a lot of hits but i don't think spammers capable of developing vcr for their needs ;)..

That would most likely make a compatibility nightmare (maybe except for animated gifs), while not solving any of the before mentioned acccessibility problems.

I suspect that video captchas in fact can be worse, at least if you don't consider decoding time, since you give the adversary more information. If you take all the still images and try to decrypt the captcha from each of them, you may be able to decode some of the still pictures, and even interframe information can be used. In a still image you only have one chance.

On could of cause think of animation that made a word "emerge from the noise", utilising the human ability to merge a set of still pictures each too noisy to interpret into an image, but I would suggest to stay with still images.

Jay R. Ashworth

10 Oct 10 Oct

3:26 p.m.

On Mon, Oct 08, 2007 at 08:43:02AM -0400, Anthony wrote:

...

Pretty much all technology is moribund. If you can come up with a technology today which will solve the unknown problems of some indefinite point in the future, by all means let us know what they are.

Computers!

Cheers, -- jra

-- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com '87 e24 St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

6285

Age (days ago)

6289

Last active (days ago)

wikitech-l@lists.wikimedia.org

27 comments

15 participants

tags (0)

participants (15)

Anthony
Brion Vibber
Christensen, Courtney
David Gerard
Gisle Sælensminde
Gregory Maxwell
Jay R. Ashworth
Mohamed Magdy
Neil Harris
Platonides
Rob Church
Simetrical
Soo Reams
Thomas Dalton
Tim Starling