[Foundation-l] Google Books

John Vandenberg jayvdb at gmail.com
Sun Jun 21 23:19:34 UTC 2009

<subject line changed>

On Mon, Jun 22, 2009 at 12:55 AM, Anthony <wikimail at inbox.org> wrote:
> On Sun, Jun 21, 2009 at 10:23 AM, Anthony <wikimail at inbox.org> wrote:
> > On Sun, Jun 21, 2009 at 8:35 AM, John Vandenberg <jayvdb at gmail.com> wrote:
> >
> >> I suggest you take a look at a few of the DJVU files provided by
> >> Internet Archive.  Then you can point out real faults that you see.
> >
> >
> > I will.  My apologies for misunderstanding your email.
> >
> Okay, http://www.archive.org/details/catholicencyclo16herbgoog happened to
> be the first book I randomly picked from Google Book Search.  There's no
> text version.

Lucky you.  Most of the other CE1913 volumes on Internet Archive have
a DJVU file.


> And the text version I find of other editions seems to be much much worse
> than the google OCR results.

The OCR engines, especially tesseract which Google uses, have only
recently started to handle multiple columns well, so old OCR output
are of lesser quality.  If an old DJVU has been copied over to
Internet Archive, Google Books may have reprocessed that book
resulting in better OCR being available that way.  Internet Archive
also reprocesses its DJVU files, and Wikisource has its own "OCR"
button which allows per-page reprocessing to be done by an OCR bot in
the background.

However, CE1913 is not a good example as it would be a bit silly to
use OCR from _anywhere_: there are multiple complete proof-read
editions on the web, including on Wikisource ;-)


Also note that Google Books shows the volumes of CE1913 as mostly "No
preview available" to me, probably because I am in Australia, and only
one or two are "Snippet view".


John Vandenberg

More information about the foundation-l mailing list