Re: [Wikitech-l] Fwd: wikicaptcha on GitHub

19 Jan 2012

On 19 January 2012 11:19, Cristian Consonni &lt;kikkocristian(a)gmail.com&gt; wrote:
...
  2012/1/15 Nikola Smolenski &lt;smolensk(a)eunet.rs&gt;rs>:
  Дана Wednesday 11 January 2012 18:19:14 Cristian
Consonni написа:
 However, to my knowledge there is not a single OCR that exports this data, nor
 is there a standard format for it. If an open source OCR could be modified to
 do this, then it would be easy to inject data retreieved from captchas back
 into OCR-ed text. And it could be used for so much more :) 
 I know (but I am not proficient in their use) at least two open source
 OCR softwares:
 * OCRopus[1a][1b], by the German Research Center for Artificial
 Intelligence, sponsored by Google
 * Tesseract[2a][2b], started by HP in far 1995, now Google-sponsored
 (yeah, this one too!) [note: as far as I know OCRopus used tesserect
 as an engine for OCR]
 * GOCR/JOCR

 I think much can be done.

 Cristian 
More related tools, the documentcloud project.

Raw Engine  => Tools
http://documentcloud.github.com/docsplit/

Tools => Human Documents
https://github.com/documentcloud/document-viewer

Human Documents => Beatiful viewers
http://www.pbs.org/newshour/rundown/documents/mark-twain-concerning-the-int…
http://www.commercialappeal.com/withers-exposed/pages-from-foia-reveal-with…

Using tesseract alone is "too much work". Tesseract want tiff files in
a particular format, and DPI.  Humans want stuff in a easy to use
format, perhaps click on a image and get the text directly behind the
mouse arrow as text can be copied and paste.

-- 
--
ℱin del ℳensaje.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Fwd: wikicaptcha on GitHub