Re: [Wikisource-l] Fwd: Can You Help us Make the 19th Century Searchable?

22 Aug 2020

Apparently, Brewster Kahle wrote (via Federico Leva - Nemo):
...

<http://blog.archive.org/2020/08/21/can-you-help-us-make-the-19th-century-searchable/>

 Take for example, this newspaper from 1847. The images
 <https://archive.org/details/sim_frederick-douglass-paper_1847-12-03_1_1>
 are not that great, but a person can read them:

 The problem is  our computers’ optical character recognition tech gets
 it wrong

<https://archive.org/stream/sim_frederick-douglass-paper_1847-12-03_1_1/sim_frederick-douglass-paper_1847-12-03_1_1_djvu.txt>,
 and the columns get confused. 
In my experience, working with ABBYY Finereader Professional,
you always need to manually check columns / zoning.
For just a few years of one newspaper, this might be a reasonable
manual work. But the problem is the same for centuries of
thousands of newspapers.

When I scanned encyclopedias (printed in 2 columns in 20
volumes x 800 pages), I manually checked columns in the OCR
program.

For Wikisource, we would need a way for the OCR program to
indicate how the zones (columns) are identified in the image,
and let the wiki user modify these zones before sending
each zone to the OCR program. It would be reasonable for
the WMF to fund a developer (or team of developers) to create
such a solution. There is already some solution for marking
parts of a picture, right? This needs to work within pages of
a PDF or Djvu file.

-- 
   Lars Aronsson (lars(a)aronsson.se)
   Linköping, Sweden

   Project Runeberg - free Nordic literature - http://runeberg.org/

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikisource-l] Fwd: Can You Help us Make the 19th Century Searchable?