Classic Wikisource issues
-------- Messaggio inoltrato -------- Oggetto: Can You Help us Make the 19th Century Searchable? Data: Fri, 21 Aug 2020 20:32:17 +0000 Mittente: Brewster Kahle
http://blog.archive.org/2020/08/21/can-you-help-us-make-the-19th-century-searchable/
Can You Help us Make the 19th Century Searchable?
In 1847, Frederick Douglass started a newspaper https://archive.org/details/pub_frederick-douglass-paper advocating the abolition of slavery that ran until 1851. After the Civil War, there was a newspaper for freed slaves, the Freedmen’s Record https://archive.org/details/pub_freedmens-record. The Internet Archive is bringing these and many more works online for free public access. But there’s a problem:
Our Optical Character Recognition (OCR), while the best commercially available OCR technology, is not very good at identifying text from older documents.
Take for example, this newspaper from 1847. The images https://archive.org/details/sim_frederick-douglass-paper_1847-12-03_1_1 are not that great, but a person can read them:
The problem is our computers’ optical character recognition tech gets it wrong https://archive.org/stream/sim_frederick-douglass-paper_1847-12-03_1_1/sim_frederick-douglass-paper_1847-12-03_1_1_djvu.txt, and the columns get confused.
What we need is “Culture Tech” (a riff on fintech, or biotech) and Culture Techies to work on important and useful projects–the things we need, but are probably not going to get gushers of private equity interest to fund. There are thousands of professionals taking on similar challenges in the field of digital humanities and we want to complement their work with industrial-scale tech that we can apply to cultural heritage materials.
One such project would be to work on technologies to bring 19th-century documents fully digital.. We need to improve OCR to enable full text search, but we also need help segmenting documents into columns and articles. The Internet Archive has lots of test materials and thousands are uploading more documents all the time.
What we do not have is a good way to integrate work on these projects with the Internet Archive’s processing flow. So we need help and ideas there as well.
Maybe we can host an “Archive Summer of CultureTech” or something…Just ideas. Maybe working with a university department that would want to build programs and classes around Culture Tech… If you have ideas or skills to contribute, please post a comment here or send an email to info@archive.org with some of this information.
The post Can You Help us Make the 19th Century Searchable? http://blog.archive.org/2020/08/21/can-you-help-us-make-the-19th-century-searchable/ appeared first on Internet Archive Blogs http://blog.archive.org.
Apparently, Brewster Kahle wrote (via Federico Leva - Nemo):
http://blog.archive.org/2020/08/21/can-you-help-us-make-the-19th-century-searchable/
Take for example, this newspaper from 1847. The images https://archive.org/details/sim_frederick-douglass-paper_1847-12-03_1_1 are not that great, but a person can read them:
The problem is our computers’ optical character recognition tech gets it wrong https://archive.org/stream/sim_frederick-douglass-paper_1847-12-03_1_1/sim_frederick-douglass-paper_1847-12-03_1_1_djvu.txt, and the columns get confused.
In my experience, working with ABBYY Finereader Professional, you always need to manually check columns / zoning. For just a few years of one newspaper, this might be a reasonable manual work. But the problem is the same for centuries of thousands of newspapers.
When I scanned encyclopedias (printed in 2 columns in 20 volumes x 800 pages), I manually checked columns in the OCR program.
For Wikisource, we would need a way for the OCR program to indicate how the zones (columns) are identified in the image, and let the wiki user modify these zones before sending each zone to the OCR program. It would be reasonable for the WMF to fund a developer (or team of developers) to create such a solution. There is already some solution for marking parts of a picture, right? This needs to work within pages of a PDF or Djvu file.
yeah, as we know OCR is a pain point. i have some success, using the google ocr button to get a better result but i have also done hundreds of 2 column unzip edits, which can take me 1 minutes per page.
we have requested an improved OCR at wishlist, which would take a comparison of proofread page versus text layer to drive an AI improved text layer. but no support. maybe we should propose to internet archive?
cheers
On Sat, Aug 22, 2020 at 6:12 PM Lars Aronsson lars@aronsson.se wrote:
Apparently, Brewster Kahle wrote (via Federico Leva - Nemo):
http://blog.archive.org/2020/08/21/can-you-help-us-make-the-19th-century-searchable/
Take for example, this newspaper from 1847. The images https://archive.org/details/sim_frederick-douglass-paper_1847-12-03_1_1 are not that great, but a person can read them:
The problem is our computers’ optical character recognition tech gets it wrong https://archive.org/stream/sim_frederick-douglass-paper_1847-12-03_1_1/sim_frederick-douglass-paper_1847-12-03_1_1_djvu.txt, and the columns get confused.
In my experience, working with ABBYY Finereader Professional, you always need to manually check columns / zoning. For just a few years of one newspaper, this might be a reasonable manual work. But the problem is the same for centuries of thousands of newspapers.
When I scanned encyclopedias (printed in 2 columns in 20 volumes x 800 pages), I manually checked columns in the OCR program.
For Wikisource, we would need a way for the OCR program to indicate how the zones (columns) are identified in the image, and let the wiki user modify these zones before sending each zone to the OCR program. It would be reasonable for the WMF to fund a developer (or team of developers) to create such a solution. There is already some solution for marking parts of a picture, right? This needs to work within pages of a PDF or Djvu file.
-- Lars Aronsson (lars@aronsson.se) Linköping, Sweden
Project Runeberg - free Nordic literature - http://runeberg.org/
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
sorry *10 to 20* minutes per page
On Mon, Aug 24, 2020 at 12:43 PM J Hayes slowking4@gmail.com wrote:
yeah, as we know OCR is a pain point. i have some success, using the google ocr button to get a better result but i have also done hundreds of 2 column unzip edits, which can take me 1 minutes per page.
we have requested an improved OCR at wishlist, which would take a comparison of proofread page versus text layer to drive an AI improved text layer. but no support. maybe we should propose to internet archive?
cheers
On Sat, Aug 22, 2020 at 6:12 PM Lars Aronsson lars@aronsson.se wrote:
Apparently, Brewster Kahle wrote (via Federico Leva - Nemo):
http://blog.archive.org/2020/08/21/can-you-help-us-make-the-19th-century-searchable/
Take for example, this newspaper from 1847. The images https://archive.org/details/sim_frederick-douglass-paper_1847-12-03_1_1 are not that great, but a person can read them:
The problem is our computers’ optical character recognition tech gets it wrong https://archive.org/stream/sim_frederick-douglass-paper_1847-12-03_1_1/sim_frederick-douglass-paper_1847-12-03_1_1_djvu.txt, and the columns get confused.
In my experience, working with ABBYY Finereader Professional, you always need to manually check columns / zoning. For just a few years of one newspaper, this might be a reasonable manual work. But the problem is the same for centuries of thousands of newspapers.
When I scanned encyclopedias (printed in 2 columns in 20 volumes x 800 pages), I manually checked columns in the OCR program.
For Wikisource, we would need a way for the OCR program to indicate how the zones (columns) are identified in the image, and let the wiki user modify these zones before sending each zone to the OCR program. It would be reasonable for the WMF to fund a developer (or team of developers) to create such a solution. There is already some solution for marking parts of a picture, right? This needs to work within pages of a PDF or Djvu file.
-- Lars Aronsson (lars@aronsson.se) Linköping, Sweden
Project Runeberg - free Nordic literature - http://runeberg.org/
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
wikisource-l@lists.wikimedia.org