sorry *10 to 20* minutes per page
On Mon, Aug 24, 2020 at 12:43 PM J Hayes <slowking4(a)gmail.com> wrote:
yeah, as we know OCR is a pain point.
i have some success, using the google ocr button to get a better result
but i have also done hundreds of 2 column unzip edits, which can take
me 1 minutes per page.
we have requested an improved OCR at wishlist, which would take a
comparison of proofread page versus text layer to drive an AI improved
text layer. but no support. maybe we should propose to internet
archive?
cheers
On Sat, Aug 22, 2020 at 6:12 PM Lars Aronsson <lars(a)aronsson.se> wrote:
>
> Apparently, Brewster Kahle wrote (via Federico Leva - Nemo):
> >
<http://blog.archive.org/2020/08/21/can-you-help-us-make-the-19th-century-searchable/>
> >
> > Take for example, this newspaper from 1847. The images
> >
<https://archive.org/details/sim_frederick-douglass-paper_1847-12-03_1_1>
> > are not that great, but a person can read them:
> >
> > The problem is our computers’ optical character recognition tech gets
> > it wrong
> >
<https://archive.org/stream/sim_frederick-douglass-paper_1847-12-03_1_1/sim_frederick-douglass-paper_1847-12-03_1_1_djvu.txt>,
> > and the columns get confused.
>
> In my experience, working with ABBYY Finereader Professional,
> you always need to manually check columns / zoning.
> For just a few years of one newspaper, this might be a reasonable
> manual work. But the problem is the same for centuries of
> thousands of newspapers.
>
> When I scanned encyclopedias (printed in 2 columns in 20
> volumes x 800 pages), I manually checked columns in the OCR
> program.
>
> For Wikisource, we would need a way for the OCR program to
> indicate how the zones (columns) are identified in the image,
> and let the wiki user modify these zones before sending
> each zone to the OCR program. It would be reasonable for
> the WMF to fund a developer (or team of developers) to create
> such a solution. There is already some solution for marking
> parts of a picture, right? This needs to work within pages of
> a PDF or Djvu file.
>
>
> --
> Lars Aronsson (lars(a)aronsson.se)
> Linköping, Sweden
>
> Project Runeberg - free Nordic literature -
http://runeberg.org/
>
>
>
> _______________________________________________
> Wikisource-l mailing list
> Wikisource-l(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/wikisource-l