Gregory Maxwell wrote:
On 10/7/07, Simetrical Simetrical+wikilist@gmail.com wrote:
On 10/7/07, Platonides Platonides@gmail.com wrote:
It has been discussed here before about the captchas which are too hard to pass. However, without samples. Today i found one of these captchas. I read ghooktrust but mediawiki didn't agree. The first letter could be a 5, but we don't use numbers. So i now finally noticed it might be an s
Well, the captcha always consists of two words concatenated together, I do believe. "Shook" is a rather obscure word, however. Perhaps the dictionary could be made less comprehensive. Although that brings us back to non-English speakers, who won't be helped at all.
It could be either, yes, looking at it. But if you refresh it gives you a different captcha, right?
We should change to random characters: Using dictionary words, even a 'secret' dictionary, substantially reduces the entropy of the captchas. Yes, the dictionary makes the captcha easier for humans but it's an even bigger help to computers which can fit much more accurate state transition models in their memory.
The goal of the captcha should be to maximize the gap between humans and computers, the goal should not be to be maximally hard.
Right now our captcha is weak by standard wisdom: the characters are too easily segmented. A tuned copy of the tesseract 2.0 OCR without any statistical modeling can recognize about 25% the letters in most of the Wikimedia captchas. Thats still pretty far from cracking it, but I bet someone skilled at captcha cracking wouldn't have too hard a time.
The captcha generator is a really simple python script that is easy and fun to modify. I made a copy here that distorts the text less but packs the characters closer together and adds a wiggly connecting line which is popular these days. The result is easier to read, making the use of mostly random characters acceptable and it completely defeats tessearct ... but I can't prove that it's not massively less secure against some other attack so I haven't proposed that we use it. :(
As the author of the original Python captcha script, I'd like to say that this sounds like an excellent idea.
Could you post the source, please?
The design rationale for the current version is to resist captcha-defeating segmentation-independent edge-slope OCR by randomizing the edge slopes of characters a lot, whilst distorting overall character and word shapes rather less. I'm a bit disappointed that Tesseract is doing so well on the output of the existing code.
Your technique for resisting more conventional OCR by preventing segmentation is complementary to this, and, as you say, should defeat Tesseract quite effectively. If it provides a more readable captcha without loss of security, and it's readable enough to allow for random characters rather than relying on whole-word recognition to compensate for reducing per-character readability, we should consider putting it into use right away.
If we do this, we should also keep the existing captcha source in reserve, even if it's weaker; the more variants we have on the captcha algorithm, the more defence in depth we will have against attackers, both in terms of changing algorithms quickly in production if the currently-used method is compromised, and providing a base for rapid development if all the existing variants are ever compromised at once.)
It also might be worth experimenting with retaining the high-frequency character-edge disruption of the old code, whilst adopting your approach for the rest of the captcha.
-- Neil