wikicaptcha on GitHub - Wikisource-l

5 Jan 2012


      [X-posting also on Wikitech-l]
Dear all,
I hope your season's holidays are going well.
I'm writing to inform you that I have released on Github a first (0.1)
version of wikicapthca[1] a ReCAPTCHA-like program for Wiki*. The
thing is born from an initial observation by Alex brollo[2] about
which we have discussed much both in WMI mailing list and also on
Wikisource-l[3].
For starters the code there is a rewriting of Alex's scripts and
nothing more, and here's what the program does for now: 1) gets a
djvu, 2) extracts the text layer, 3) identifies non recognized words
and 4) produce a tiff image of them.
But I would like to write a "proof of concept" of the whole process
from getting OCR-ed djvu's from Commons to producing challenges,
serving them, collecting answer and then using those answers in some
useful way. That's said, you see there's still a long way to go.
Obviously in the long run we could use this system as a backup for the
current one, which has demonstrated some limitations[4], but I'm sure
there are many aspects of the problem which go beyond my knowledge
(I'm a physicist not a computer scientist, you know) so any help
and/or advice is welcome.
Thanks for your time.
Cristian
[1]https://github.com/CristianCantoro/wikicaptcha
[2]http://it.wikisource.org/wiki/Utente:Alex_brollo
[3]http://lists.wikimedia.org/pipermail/wikisource-l/2011-February/000939.html
[4]http://lists.wikimedia.org/pipermail/wikitech-l/2011-November/056078.html