[X-posting also on Wikitech-l]
Dear all,
I hope your season's holidays are going well. I'm writing to inform you that I have released on Github a first (0.1) version of wikicapthca[1] a ReCAPTCHA-like program for Wiki*. The thing is born from an initial observation by Alex brollo[2] about which we have discussed much both in WMI mailing list and also on Wikisource-l[3]. For starters the code there is a rewriting of Alex's scripts and nothing more, and here's what the program does for now: 1) gets a djvu, 2) extracts the text layer, 3) identifies non recognized words and 4) produce a tiff image of them. But I would like to write a "proof of concept" of the whole process from getting OCR-ed djvu's from Commons to producing challenges, serving them, collecting answer and then using those answers in some useful way. That's said, you see there's still a long way to go.
Obviously in the long run we could use this system as a backup for the current one, which has demonstrated some limitations[4], but I'm sure there are many aspects of the problem which go beyond my knowledge (I'm a physicist not a computer scientist, you know) so any help and/or advice is welcome.
Thanks for your time.
Cristian
[1]https://github.com/CristianCantoro/wikicaptcha [2]http://it.wikisource.org/wiki/Utente:Alex_brollo [3]http://lists.wikimedia.org/pipermail/wikisource-l/2011-February/000939.html [4]http://lists.wikimedia.org/pipermail/wikitech-l/2011-November/056078.html
wikisource-l@lists.wikimedia.org