sorry *10 to 20* minutes per page
On Mon, Aug 24, 2020 at 12:43 PM J Hayes <slowking4(a)gmail.com> wrote:
yeah, as we know OCR is a pain point.
i have some success, using the google ocr button to get a better result
but i have also done hundreds of 2 column unzip edits, which can take
me 1 minutes per page.
we have requested an improved OCR at wishlist, which would take a
comparison of proofread page versus text layer to drive an AI improved
text layer. but no support. maybe we should propose to internet
On Sat, Aug 22, 2020 at 6:12 PM Lars Aronsson <lars(a)aronsson.se> wrote:
> Apparently, Brewster Kahle wrote (via Federico Leva - Nemo):
> > Take for example, this newspaper from 1847. The images
> > are not that great, but a person can read them:
> > The problem is our computers’ optical character recognition tech gets
> > it wrong
> > and the columns get confused.
> In my experience, working with ABBYY Finereader Professional,
> you always need to manually check columns / zoning.
> For just a few years of one newspaper, this might be a reasonable
> manual work. But the problem is the same for centuries of
> thousands of newspapers.
> When I scanned encyclopedias (printed in 2 columns in 20
> volumes x 800 pages), I manually checked columns in the OCR
> For Wikisource, we would need a way for the OCR program to
> indicate how the zones (columns) are identified in the image,
> and let the wiki user modify these zones before sending
> each zone to the OCR program. It would be reasonable for
> the WMF to fund a developer (or team of developers) to create
> such a solution. There is already some solution for marking
> parts of a picture, right? This needs to work within pages of
> a PDF or Djvu file.
> Lars Aronsson (lars(a)aronsson.se)
> Linköping, Sweden
> Project Runeberg - free Nordic literature - http://runeberg.org/
> Wikisource-l mailing list