[Wikitech-l] Fwd: wikicaptcha on GitHub

25 Jan 2012


      We sort of use IA's data already, because many Wikisource texts are 
OCR'ed on IA. If we manage to use OCR improvements within DjVu, it 
shouldn't be too difficult to reupload such DjVu in their items and then 
they could do what they want with them.
...
OCRs generally work by finding lines of text on a page, splitting the
lines
into letters, then recognizing each letter separately. So, an OCR would
...
know,
for each letter of the recognized text, what is its bounding box on
the page.
...
However, to my knowledge there is not a single OCR that exports this
data, nor
...
is there a standard format for it. If an open source OCR could be
modified to
...
do this, then it would be easy to inject data retreieved from
captchas back
...
into OCR-ed text. And it could be used for so much more :)
I don't understand, what data are you talking about? DjVu is an open 
format and can store character mappings, which is what the wikicaptcha 
proof of concept is based on. There's also 
https://en.wikipedia.org/wiki/HOCR and IA uses some proprietary ABBYY 
xml format which AFAIK can be somehow read and converted to hOCR.
The real problem is character training which could be used for 
subsequent OCRs. I doubt we can do much here, because everyone uses 
ABBYY, and even tesseract users don't seem to share such data in any way.
Nemo

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] Fwd: wikicaptcha on GitHub