[Wikitech-l] Wikipedia CAPTCHA repair

3 Nov 2011


      Now that the Wikipedia CAPTCHA has been comprehensively broken by 
Burzstein et. al. in their paper "Text-based CAPTCHA Strengths and 
Weaknesses":
http://elie.im/publication/text-based-captcha-strengths-and-weaknesses
it's time to fix the current CAPTCHA system while there is still no 
evidence that it is yet being automatically exploited on a large scale.
Accordingly, I've reworked the 2005-era CAPTCHA-image-generating Python 
script in the CAPTCHA engine in a way that I hope should be a drop-in 
replacement for the existing script.
Following the recommendations of the paper's authors, I've made several 
improvements, each of which is relatively weak, but which all put 
together I hope should present a defence in depth against the techniques 
described in the paper.
I've also reduced the strength some of the existing features of the 
current CAPTCHA identified by the paper as not being sufficiently 
effective against modern attacks.
For example, the paper shows that noise-based blurring / fragmenting of 
individual characters is not an effective measure against modern shape 
classifiers. This was a major feature of the previous code, which used 
this to try to attempt to confuse edge-slope based recognizers which 
were one of the most promising attacks being developed at the time it 
was written. Now this is shown to no longer be as useful as I had 
thought, I've backed off quite a bit on this -- without removing it 
completely -- while trying to strengthen other features of the CAPTCHA.
Similarly, the paper identifies geometric distortion alone as being a 
relatively weak technique, unless combined with effective geometric 
confusion and anti-segmentation measures.
So I've added the following:
* Negative kerning of the characters to join them together at the edges, 
making segmentation more difficult
* The addition of a randomly-placed near-horizontal long and shallowly 
curved confusion line to make the job of segmenters and shape 
recognizers just that bit more difficult. This line is added in the 
middle of the image stirring process, so that it is not aligned either 
with the text or the output raster, and should thus not help either 
undoing the distortion or recovering the text baseline, while still 
breaking up the topology and geometry of the text.
* More stages of more subtle image stirring, solely intended to provide 
sufficient extra geometric distortion and character outline disruption 
to make defeating the confusion created by these two more difficult.
To counteract the reduction in human readability, I've upped the default 
font size a bit, and I now suggest a serif font such as Droid Serif 
instead of the previous sans-serif font.
As ever, there are a vast number of twiddly arbitrary parameters in the 
code, which I have determined entirely by trial and error, in an attempt 
to balance the anti-machine-recognition measures with one another, while 
maintaning reasonable levels of human readability. There is plenty of 
scope for adjusting these.
The current results look to me like they are more likely to be resistant 
to the attacks described in the paper than the current code, but I'd be 
interested in getting some more eyes on the problem.
Would anyone be interested in taking a look at the code and some sample 
output?
-- Neil

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] Wikipedia CAPTCHA repair