Re: [Wikitech-l] Captcha readibility

8 Oct 2007


      Gregory Maxwell wrote:
...
On 10/7/07, Simetrical Simetrical+wikilist@gmail.com wrote:
...
On 10/7/07, Platonides Platonides@gmail.com wrote:
...
It has been discussed here before about the captchas which are too hard
to pass. However, without samples.
Today i found one of these captchas. I read ghooktrust but mediawiki
didn't agree. The first letter could be a 5, but we don't use numbers.
So i now finally noticed it might be an s
Well, the captcha always consists of two words concatenated together,
I do believe.  "Shook" is a rather obscure word, however.  Perhaps the
dictionary could be made less comprehensive.  Although that brings us
back to non-English speakers, who won't be helped at all.
It could be either, yes, looking at it.  But if you refresh it gives
you a different captcha, right?
We should change to random characters: Using dictionary words, even a
'secret' dictionary, substantially reduces the entropy of the
captchas.  Yes, the dictionary makes the captcha easier for humans but
it's an even bigger help to computers which can fit much more accurate
state transition models in their memory.
The goal of the captcha should be to maximize the gap between humans
and computers, the goal should not be to be maximally hard.
Right now our captcha is weak by standard wisdom: the characters are
too easily segmented.  A tuned copy of the tesseract 2.0 OCR without
any statistical modeling can recognize about 25% the letters in most
of the Wikimedia captchas. Thats still pretty far from cracking it,
but I bet someone skilled at captcha cracking wouldn't have too hard a
time.
The captcha generator is a really simple python script that is easy
and fun to modify.  I made a copy here that distorts the text less but
packs the characters closer together and adds a wiggly connecting line
which is popular these days.  The result is easier to read, making the
use of mostly random characters acceptable and it completely defeats
tessearct ... but I can't prove that it's not massively less secure
against some other attack so I haven't proposed that we use it. :(
As the author of the original Python captcha script, I'd like to say 
that this sounds like an excellent idea.
Could you post the source, please?
The design rationale for the current version is to resist 
captcha-defeating segmentation-independent edge-slope OCR by randomizing 
the edge slopes of characters a lot, whilst distorting overall character 
and word shapes rather less. I'm a bit disappointed that Tesseract is 
doing so well on the output of the existing code.
Your technique for resisting more conventional OCR by preventing 
segmentation is complementary to this, and, as you say, should defeat 
Tesseract quite effectively. If it provides a more readable captcha 
without loss of security, and it's readable enough to allow for random 
characters rather than relying on whole-word recognition to compensate 
for reducing per-character readability, we should consider putting it 
into use right away.
If we do this, we should also keep the existing captcha source in 
reserve, even if it's weaker; the more variants we have on the captcha 
algorithm, the more defence in depth we will have against attackers, 
both in terms of changing algorithms quickly in production if the 
currently-used method is compromised, and providing a base for rapid 
development if all the existing variants are ever compromised at once.)
It also might be worth experimenting with retaining the high-frequency 
character-edge disruption of the old code, whilst adopting your approach 
for the rest of the captcha.
-- Neil

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Captcha readibility