Now that the Wikipedia CAPTCHA has been comprehensively broken by Burzstein et. al. in their paper "Text-based CAPTCHA Strengths and Weaknesses":
http://elie.im/publication/text-based-captcha-strengths-and-weaknesses
it's time to fix the current CAPTCHA system while there is still no evidence that it is yet being automatically exploited on a large scale.
Accordingly, I've reworked the 2005-era CAPTCHA-image-generating Python script in the CAPTCHA engine in a way that I hope should be a drop-in replacement for the existing script.
Following the recommendations of the paper's authors, I've made several improvements, each of which is relatively weak, but which all put together I hope should present a defence in depth against the techniques described in the paper.
I've also reduced the strength some of the existing features of the current CAPTCHA identified by the paper as not being sufficiently effective against modern attacks.
For example, the paper shows that noise-based blurring / fragmenting of individual characters is not an effective measure against modern shape classifiers. This was a major feature of the previous code, which used this to try to attempt to confuse edge-slope based recognizers which were one of the most promising attacks being developed at the time it was written. Now this is shown to no longer be as useful as I had thought, I've backed off quite a bit on this -- without removing it completely -- while trying to strengthen other features of the CAPTCHA.
Similarly, the paper identifies geometric distortion alone as being a relatively weak technique, unless combined with effective geometric confusion and anti-segmentation measures.
So I've added the following:
* Negative kerning of the characters to join them together at the edges, making segmentation more difficult
* The addition of a randomly-placed near-horizontal long and shallowly curved confusion line to make the job of segmenters and shape recognizers just that bit more difficult. This line is added in the middle of the image stirring process, so that it is not aligned either with the text or the output raster, and should thus not help either undoing the distortion or recovering the text baseline, while still breaking up the topology and geometry of the text.
* More stages of more subtle image stirring, solely intended to provide sufficient extra geometric distortion and character outline disruption to make defeating the confusion created by these two more difficult.
To counteract the reduction in human readability, I've upped the default font size a bit, and I now suggest a serif font such as Droid Serif instead of the previous sans-serif font.
As ever, there are a vast number of twiddly arbitrary parameters in the code, which I have determined entirely by trial and error, in an attempt to balance the anti-machine-recognition measures with one another, while maintaning reasonable levels of human readability. There is plenty of scope for adjusting these.
The current results look to me like they are more likely to be resistant to the attacks described in the paper than the current code, but I'd be interested in getting some more eyes on the problem.
Would anyone be interested in taking a look at the code and some sample output?
-- Neil