Via Slashdot - could be useful for future digitization efforts. Anyone
know how good it is compared to the latest ScanSoft engine?
Announcing Tesseract OCR
By Eric Case - 12:25 PM
Post by Luc Vincent, Uber Tech Lead
We wanted to let you all know that a few months ago we quietly
released - or actually re-released - an Optical Character Recognition
(OCR) engine into open source. You might wonder why Google is
interested in OCR? In a nutshell, we are all about making information
available to users, and when this information is in a paper document,
OCR is the process by which we can convert the pages of this document
into text that can then be used for indexing.
This particular OCR engine, called Tesseract, was in fact not
originally developed at Google! It was developed at Hewlett Packard
Laboratories between 1985 and 1995. In 1995 it was one of the top 3
performers at the OCR accuracy contest organized by University of
Nevada in Las Vegas. However, shortly thereafter, HP decided to get
out of the OCR business and Tesseract has been collecting dust in an
HP warehouse ever since. Fortunately some of our esteemed HP
colleagues realized a year or two ago that rather than sit on this
engine, it would be better for the world if they brought it back to
life by open sourcing it, with the help of the Information Science
Research Institute at UNLV. UNLV was happy to oblige, but they in turn
asked for our help in fixing a few bugs that had crept in since 1995
(ever heard of bit rot?)... We tracked down the most obvious ones and
decided a couple of months ago that Tesseract OCR was stable enough to
be re-released as open source.
A few things to know about Tesseract OCR: for now it only supports the
English language, and does not include a page layout analysis module
(yet), so it will perform poorly on multi-column material. It also
doesn't do well on grayscale and color documents, and it's not nearly
as accurate as some of the best commercial OCR packages out there.
Yet, as far as we know, despite its shortcomings, Tesseract is far
more accurate than any other Open Source OCR package out there. If you
know of one that is more accurate, please do tell us!
We are grateful to all the people at HP who made it possible to
release Tesseract into open source, and especially John Burns, who
championed and babysat the project. We would also like to thank the
original Tesseract development team, a partial list of whom is here.
Last but not least, many thanks to our friends at UNLV's ISRI,
including Tom Nartker, Kazem Taghva, Julie Borsack and Steve Lumos,
for all their help with this project.
Peace & Love,