Last week, two new AI training datasets were made publicly available. Eleuther.ai published the Common Pile 0.1, and Harvard Law School Library’s Institutional Data Initiative launched Institutional Books 1.0.
Lack of sufficient hiqh-quality data for AI training is a major problem that hinders open-source, public interest, and research AI labs. Stefaan Verhulst has been arguing that we’re facing a data winter: a decline in opening up data for reuse, in particular for AI. And researchers from the Data Provenance Initiative have been charting a decline of what they call the AI Data Commons.
As legal challenges to training on web-scraped datasets continue to emerge, AI developers have been considering openly licensed and Public Domain works as sources from which a legally safe dataset can be built. But concerns have been raised whether these datasets—lacking the vast web-scraped sources—will be robust enough.
The new datasets provide an answer to this question. And, more importantly, they demonstrate two different paths for sharing datasets: one based on the principles of the Public Domain and Open Data, and a commons-based one.
The first path, adopted by Eleuther.ai in building the Common Pile, is based on the principles of the Public Domain and the Open Definition. This traditional approach aims to maximise the utility and accessibility of resources by making them open. The Institutional Data Initiative, meanwhile, charts a new course based on gated access and a set of rules based on the principles of the data commons, in order to make data sharing more sustainable and community-based.
Common Pile includes 8TB of texts that are either in the Public Domain or released under licenses compliant with the Open Definition. The thirty diverse sources include code from the Stack v2 repository (which constitutes over 50% of the dataset), government and legal documents, wikis, web pages and online forums, academic papers, Public Domain books, and open educational resources.
The dataset was created by Eleuther.ai, a research organization working to empower open source AI development. Four years ago, Eleuther released the Pile, an 880GB dataset that included various publicly available resources, including openly licensed works, web-scraped content, but also Books3, the infamous collection of pirated books. The Pile was, at that time, the largest pretraining dataset, and Eleuther went on to train a family of GPT-Neo models using this data. More importantly, it became a standardized training resource for LLM developers, used by many other projects. With the release, Eleuther aimed to demonstrate the value of publicly releasing training datasets to enable research, increase transparency, and support benchmarking of model performance. Yet the inclusion of pirated in-copyright works turned out to be a liability for model developers. The Pile was taken offline under pressure from rightsholders.
The Common Pile is an effort to build a commonly shared dataset that has the same valuable characteristics for AI research and development, but without the legal liabilities. Unlike other projects—Common Corpus for example—that depend on collection-level licensing declarations, the Common Pile’s curators meticulously reviewed licensing data at the level of individual works. The main challenge was identifying “license laundering”—cases where copyrighted works are distributed with an incorrectly applied open license.
The key concern with openly licensed datasets is whether they are robust and large enough to serve as a basis for training modern language models. If so, then it becomes possible to build AI solutions based on a fully open stack that includes training data, various code components, and the model itself. Yet this proposal has often been met with skepticism about the quality of the resulting language model. The Eleuther.ai team therefore trained a language model called Comma on the Common Pile, with sizes of 1B, 2B, and 7B parameters (variables that the model uses when generating text)—the last being a standard size for various small, open source models released recently. Tests on popular AI testing benchmarks demonstrate that Comma either performs on par with or outperforms similar models, including Llama 2 and Qwen3. Such benchmarks say little about the actual usefulness of Comma, but can be treated as initial proof that models trained on open data can be of the same quality as other ones.
The Common Pile is also a response to the growing discontent with training AI on publicly available web resources—a trend to which the original Pile contributed. The problems include objections over uncompensated use of content, ethical concerns, and potential large-scale lack of consent, all leading to a growing divide between LLM developers and content creators. Training on freely available works is seen as a solution to this challenge.
Michael Weinberg, lawyer leading the Engelberg Center on Innovation Law and Policy at the School of Law asks, Does an AI Dataset of Openly Licensed Works Matter?—and argues that even if compiling openly licensed texts sounds easy, it is actually hard to build a dataset that allows for an “LLM free of copyright issues”. Weinberg focuses on issues with complying at large scale with attribution requirements. He goes on to suggest that an openly licensed dataset with fully cleared rights remains a moving target and might not offer the legal certainty that AI developers need—both to build their models and to regain trust of content creators and stewards. These are valid concerns, inherent to any attempt to share content at a massive scale—addressing them would require regulatory intervention that would establish some form of de minimis protection from liability.
Nevertheless, the Common Pile is important as a principled effort to share a robust and useful AI training dataset as Open Data. Eleuther.ai declares that it will continue curating open datasets and adding them to the Common Pile. In this approach, the goal is to release as much content as possible under as permissive conditions as possible. It follows the same approach that has been used over the past quarter-century by successive waves of efforts to open up various types of resources.
The Institutional Data Initiative, based at Harvard University, starts with a similar premise: scarcity of publicly available, high-quality training data, and a similar solution: a large dataset of Public Domain texts. Yet the source of their Institutional Books 1.0 dataset is much more specific: it consists of scans of Public Domain books held by the Harvard Library and digitized beginning in 2006 through the Google Books project. Where Eleuther.ai has been aggregating sources they can access on the web, Harvard has created a dataset from its own collection. The resulting dataset consists of almost one million books, equal to 242 billion tokens.
The technical pipelines for cleaning the text data, organizing it, and confirming its legal status are relatively similar to those developed for the Common Pile. In each case, efforts are made to ensure what IDI describes as “information stewardship.” Admittedly, IDI’s work is made easier by the fact that they are building a dataset from a library collection of books. Therefore, while the Common Pile white paper outlines complicated procedures for determining copyright status of works, IDI accomplished this automatically for over 93% of works, based on the HathiTrust rights database.
The data sharing framework is where the IDI approach differs significantly from the one adopted by Eleuther.ai. IDI introduced Terms of Use for the dataset as a whole, even though the underlying book scans are in the Public Domain. These limit the use of the dataset to noncommercial purposes, prohibit sharing and distribution of the dataset, as well as derivatives that could substitute for the original. In addition, access to the dataset hosted on HuggingFace is gated: contact information must be shared with IDI and you must agree to the Terms of Service, in order to download the data.
From a traditional Open Data or Open Access perspective, these are severe limitations on the freedom to use and share Public Domain content. At the same time, they fit within the boundaries of commons-based data set governance—an approach that balances access and inclusivity with concerns for sustainability. This is well expressed in the dataset paper as “minor friction [that] can help build the relationships and norms necessary to grow a collaborative community”.
The main reason for doing so, mentioned repeatedly in the white paper, is the goal of building a community and an “organic institutional commons.” Gated access allows IDI to establish a community-led process to determine ways of using the data and managing the dataset. Controlling access is also a prerequisite for being able to map the community of reusers and for establishing relationships.
Less prominent in the papers are goals related to sustainability and protection of the commons. While this is not stated directly, gated access—with limits placed on commercial use, redistribution and derivation—is a means of making the effort sustainable. It is a clear example of addressing the Paradox of Open, and managing the risk of exploitation of the commons. Without these limitations, the dataset would immediately be treated as a resource to be exploited. With gated access, conditions of reuse by the largest commercial labs can be negotiated.
In order to understand this approach and to see gated access as not just a limitation on Open Access, it helps to treat the dataset as public infrastructure that needs to be provisioned in a sustainable way. Traditionally, these concerns have been out of the scope of open sharing frameworks, which focused on reducing legal barriers to content sharing. Today, it’s increasingly clear that open sharing of content needs to go hand in hand with sustainable provision of necessary infrastructure—the challenge has recently been demonstrated by evidence shared by the Wikimedia Foundation, on the costs of data access. Limitations on access create conditions to negotiate and establish a more sustainable approach, with some form of reciprocity required in particular from the large AI companies.
These two Public Domain datasets outline two pathways to data sharing for AI training: one based on Open Data principles, and another based on commons governance. Both start with a similar premise: that openly shared and Public Domain works can constitute a resource for AI training that is both a robust base for modern LLMs and reduces legal liability. As such, they are important examples of how such datasets can democratize AI development.
The Institutional Books dataset is, in addition, an important experiment in dataset governance. Where Eleuther adopts a well-established Open Data model with minimal governance layers limited to open licensing, IDI is experimenting with a new, gated approach. In this way, the IDI and Harvard Library position themselves as organizations capable of and willing to steward a common dataset and an ecosystem around it. The gated access mechanisms and use limitations will limit reuse, but at the same time foster a community—something that cannot be easily accomplished in the traditional Open Data model.
The difference in approach can be explained by the different roles that Eleuther.ai and the Institutional Data Initiative play in the data and AI ecosystems. Eleuther, as a reuser of publicly available resources, is well-suited to adopt an Open Data approach—as it is an aggregator and reuser, rather than a steward of a collection. For IDI, which is located in an institution that stewards a significant content collection, a commons-based approach is a better fit.
The provision of a robust, Public Domain-based dataset is an important milestone achieved by both initiatives—hopefully it will be further validated by AI development efforts using this data. In addition to that, the governance approach proposed by IDI is an important experiment in establishing new forms of data sharing that account for challenges inherent to AI development.
Hopefully, both initiatives will transparently share data on the reuse of their datasets—especially on the types of models built on their basis. Open and commons-based dataset will fulfill their mission of democratizing AI only if they are accompanied by state-of-the-art, open source models (for a full argument on the importance of such models, see the white paper on public AI).
Finally, the two examples are important for the ongoing European debate on data for AI. Both of the datasets have been developed by American entities, and while multi-lingual, they are skewed towards the English language. The new Data Union Strategy, currently consulted by the European Commission, should include a key action to develop an AI Commons: an ecosystem of open and commons-based datasets for all European languages.