Hi all. Great to meet you and thank you to Leila for inviting me to join the list. I’m the Executive Director of the Institutional Data Initiativehttps://www.institutionaldatainitiative.org (IDI) at Harvard and I wanted to share our recent data release—Institutional Bookshttps://www.institutionaldatainitiative.org/institutional-books, a collection of nearly 1M public domain books, scanned at Harvard Library through the Google Books project.
IDI works with libraries and other knowledge institutions to publish their collections as data with the goal of establishing public-interest leverage in the AI ecosystem while improving collections for traditional patron usage. With each project, we look for novel ways to structure and analyze the collection and set standards along the way. With Institutional Books, we tackled language analysis, topic classification, and OCR correction, and our technical reporthttps://arxiv.org/abs/2506.08300 has even more. We hope to evolve the collection over time and release new formats as we go, such as EPUB and Markdown.
We’re also using this moment to experiment with a time-bounded Terms of Service that attempts to privilege open and noncommercial actors while garnering support from commercial actors as we iterate on sustainability. The goal is to eventually make the collection and all of its scans available under a more traditional open model.
Thoughts, questions, and collaboration welcomed. We also have a Slack where we’re talking about this collection and others. Or next project is to dig in on a new collection of old newspapers, in collaboration with Boston Public Library, as we work toward building a global commons.
—Greg