Hi Greg (CCing the "Wikimedia & GLAM collaboration" mailing list),
First, as there has been no reaction here yet: Congrats to you and Harvard Law School Library on this release! A dataset of one million high-quality-OCR public domain books sounds very impressive.
However, your message here, and in particular its highlighting of *"time-bounded Terms of Service that attempts to privilege open and noncommercial actors"*, give the distinct impression that you are unaware of some central aspects of Wikipedia and the Wikimedia movement, or indeed the wider free-culture movement as well. While the Wikimedia Foundation is indeed a nonprofit organization, and Wikipedia and the other Wikimedia projects are indeed noncommercial, they have never accepted content licenses or terms that are confined to "open and noncommercial actors". So let me link some explanatory material:
The Wikimedia Foundation's licensing policy https://foundation.wikimedia.org/wiki/Resolution:Licensing_policy (which governs the content on Wikipedia and all other Wikimedia projects) relies on *a definition of "free content" that excludes licenses limited to noncommercial usage*, like your terms are. Summarizing the rationales for this long-standing decision would go too far here - if you are interested in those, this https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/thread/JUNXXJPIZRMCFAPNJEGXPPENCOS6DOQW/ might be a good starting point. But to highlight one well-known problem with such licenses (in particularly the -NC variants of the Creative Commons licenses), because it may help to illustrate some especially problematic restrictions that you/your lawyers attempt to impose: People have found out time and again that it is difficult to actually define commercial usage, in a way that doesn't have unintended consequences. (E.g. could a hobbyist blogger be sued for using an NC-licensed image because her blog features some Google ads?) Creative Commons even ran a whole study in an attempt to retroactively clarify such boundaries.
But in any case, despite these well-documented complications, legal restrictions about the commercial *usage* of particular material still seem more straightforward to figure out than the *restrictions on "intent" and "affiliation" of the *user** that you (or Harvard's lawyers?) try to impose in the terms of use for this release https://huggingface.co/datasets/institutional/institutional-books-1.0:
"Open-source projects and other public-use efforts are welcome, even if they may indirectly support commercial use, so long as they are unaffiliated with commercial actors or intent."
Your requirement that an open source project must not even be "affiliated" with "commercial ... intent", would likely exclude, say, the majority of widely used (e.g. by Wikimedia organizations https://meta.wikimedia.org/wiki/FLOSS-Exchange) open source software projects, which are frequently either maintained by a commercial company, or by volunteers who also have a related day job as developer or may offer paid support. Even the most anticapitalist purists in the free software movement shy away from such restrictions in their licenses. In any case, we can be pretty sure that your clause rules out the Wikimedia Foundation, as it is not just "affiliated" with a commercial actor but has one directly incorporated as a subsidiary, namely the for-profit Wikimedia LLC. You don't seem to be aware of this, given that you came here with the apparent impression that an offer to "privilege open and noncommercial actors" may enable a cooperation.
The second clause of your terms https://x.com/tilmanbayer/status/1933311788688552165 (*"No Redistribution"*) is likewise a non-starter for "open actors" - it is almost the definition of non-open.
I do realize of course that there will be many AI/ML folks on HF and elsewhere who are happy to use such a dataset while blissfully ignore such attempts to impose restrictions on public domain content, perhaps assuming - possibly correctly - that you didn't think these terms of use through very thoroughly and are thus unlikely to enforce them, or who are simply not yet as familiar with the long-term effects of such legal footguns as Wikimedians and FLOSS developers have become over many years. That said, I've seen your terms cause consternation in the open AI/ML world too, e.g. on the EleutherAI Discord.
You should also be aware that in the history of the Wikimedia movement there have been some some ugly *legal disputes with GLAMs* (galleries, libraries, archives and museums, i.e. organizations like yours) who *attempted to restrict reproduction of public domain works* in their possession with similar rationales (i.e. an alleged need to extract revenue to refinance digitization efforts or such, which I hear echoing in your vague remarks about "sustainability" "ecosystem" etc). Two examples:
- https://en.wikipedia.org/wiki/National_Portrait_Gallery_and_Wikimedia_Founda...
- https://en.wikipedia.org/wiki/Reiss_Engelhorn_Museum#Wikimedia_lawsuit (While that museum prevailed in court against the Wikimedia Foundation, the EU Copyright Directive subsequently made such assertions of copyright over faithful reproductions of public domain works impossible.)
I'm not saying that the Institutional Books project is likely to become similarly contentious (if only for the simple reason that Wikimedians have long already been importing the same underlying Google Books scans https://commons.wikimedia.org/wiki/Category:Scans_from_Google_Books, often to do their own OCR and proofreading on Wikisource). I'm just trying to help you understand that the restrictions on public access that you attempt to impose here under the label of "public-interest leverage" - i.e. your own institution retaining control over the content so you can monetize it - are likely to be seen as unacceptable by the open content movement.
Another point you should be aware of is that while Wikimedia volunteers spend a lot of time diligently enforcing the copyrights of third parties (by deleting infringing material uploaded to Wikimedia projects), they *explicitly reject https://commons.wikimedia.org/wiki/Commons:Non-copyright_restrictions enforcing non-copyright terms* imposed by such third parties.
Lastly, a question: You say here https://www.institutionaldatainitiative.org/posts/open-call-for-collaborators that you (the Institutional Data Initiative) are "one of the Harvard-affiliated beneficiaries of OpenAI's new NextGenAI consortium". *Is OpenAI also one of your customers* paying for privileged access to the Institutional Books dataset (while your terms exclude the general public from it for the time being)? I'm not arguing that OpenAI is evil per se, or that academic institutions and GLAMs must never collaborate with Big Tech companies. (After all, Google Books, which your project is based on, was such a collaboration between Big Tech and academic libraries in the first place. And many Wikipedians can testify to its great value and usefulness for the general public.) However, the obfuscatory language in your post here regarding commercial partnerships and monetization ("garnering support from commercial actors as we iterate on sustainability"), combined with vague gesturing at a possible time-delayed free release at an undetermined point in the future, doesn't exactly inspire trust in this matter. If the project provides more transparent information about this question elsewhere, feel free to provide pointers. It would also be interesting to learn how much revenue the Institutional Data Initiative projects to derive from this monetization of public domain works.
Regards, Tilman ([[User:HaeB]])
On Mon, Jul 7, 2025 at 7:32 AM Leppert, Greg gleppert@law.harvard.edu wrote:
Hi all. Great to meet you and thank you to Leila for inviting me to join the list. I’m the Executive Director of the Institutional Data Initiative< https://www.institutionaldatainitiative.org%3E (IDI) at Harvard and I wanted to share our recent data release—Institutional Books< https://www.institutionaldatainitiative.org/institutional-books%3E, a collection of nearly 1M public domain books, scanned at Harvard Library through the Google Books project.
IDI works with libraries and other knowledge institutions to publish their collections as data with the goal of establishing public-interest leverage in the AI ecosystem while improving collections for traditional patron usage. With each project, we look for novel ways to structure and analyze the collection and set standards along the way. With Institutional Books, we tackled language analysis, topic classification, and OCR correction, and our technical reporthttps://arxiv.org/abs/2506.08300 has even more. We hope to evolve the collection over time and release new formats as we go, such as EPUB and Markdown.
We’re also using this moment to experiment with a time-bounded Terms of Service that attempts to privilege open and noncommercial actors while garnering support from commercial actors as we iterate on sustainability. The goal is to eventually make the collection and all of its scans available under a more traditional open model.
Thoughts, questions, and collaboration welcomed. We also have a Slack where we’re talking about this collection and others. Or next project is to dig in on a new collection of old newspapers, in collaboration with Boston Public Library, as we work toward building a global commons.
—Greg _______________________________________________ Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to wiki-research-l-leave@lists.wikimedia.org