Fwd: Can You Help us Make the 19th Century Searchable? - Wikisource-l

22 Aug 2020

Classic Wikisource issues

-------- Messaggio inoltrato --------
Oggetto: 	Can You Help us Make the 19th Century Searchable?
Data: 	Fri, 21 Aug 2020 20:32:17 +0000
Mittente: 	Brewster Kahle

<http://blog.archive.org/2020/08/21/can-you-help-us-make-the-19th-century-searchable/>

Can You Help us Make the 19th Century Searchable?

In 1847, Frederick Douglass started a newspaper
<https://archive.org/details/pub_frederick-douglass-paper> advocating
the abolition of slavery that ran until 1851.  After the Civil War,
there was a newspaper for freed slaves, the Freedmen’s Record
<https://archive.org/details/pub_freedmens-record>.  The Internet
Archive is bringing these and many more works online for free public
access. But there’s a problem: 

Our Optical Character Recognition (OCR), while the best commercially
available OCR technology, is not very good at identifying text from
older documents.  

Take for example, this newspaper from 1847. The images
<https://archive.org/details/sim_frederick-douglass-paper_1847-12-03_1_1>
are not that great, but a person can read them:

The problem is  our computers’ optical character recognition tech gets
it wrong
<https://archive.org/stream/sim_frederick-douglass-paper_1847-12-03_1_1/sim_frederick-douglass-paper_1847-12-03_1_1_djvu.txt>,
and the columns get confused.

What we need is “Culture Tech” (a riff on fintech, or biotech) and
Culture Techies to work on important and useful projects–the things we
need, but are probably not going to get gushers of private equity
interest to fund. There are thousands of professionals taking on similar
challenges in the field of digital humanities and we want to complement
their work with industrial-scale tech that we can apply to cultural
heritage materials.

One such project would be to work on technologies to bring 19th-century
documents fully digital.. We need to improve  OCR to enable full text
search, but we also need help segmenting documents into columns and
articles. The Internet Archive has lots of test materials and thousands
are uploading more documents all the time.    

What we do not have is a good way to integrate work on these projects
with the Internet Archive’s processing flow.  So we need help and ideas
there as well.

Maybe we can host an “Archive Summer of CultureTech” or something…Just
ideas.   Maybe working with a university department that would want to
build programs and classes around Culture Tech… If you have ideas or
skills to contribute, please post a comment here or send an email to
info(a)archive.org with some of this information.

The post Can You Help us Make the 19th Century Searchable?
<http://blog.archive.org/2020/08/21/can-you-help-us-make-the-19th-century-searchable/>
appeared first on Internet Archive Blogs <http://blog.archive.org>.