Towards a Books Data Commons for AI Training

April 8, 2024

This white paper describes ways of building a books data commons: a responsibly designed, broadly accessible data set of digitized books to be used in training AI models. This report, written in partnership with Creative Commons and Proteus Strategies, is based on a series of workshops that brought together practitioners building AI models, legal and policy scholars, and experts working with collections of digitized books.

Most large language models are typically trained mainly on texts obtained through web scraping. Still, books have also played an important role in their development. Due to their length, quality of editing, and breadth of subject matter, books are an important type of training data that can improve the quality of a model. Many AI developers have been relying on a dataset of books called “Books3”, used to train some of the most popular models available today. This dataset turned out to be sourced from a site not authorized to share the content and, therefore, became a major liability for AI developers.

The aim of this report is to explore other ways in which a collection of books can be made available for AI training. We use the term “commons” to describe a resource that is broadly shared and accessible and thus obviates the need for each individual actor to acquire, digitize, and format their own corpus of books for AI training. A shared books data commons would reduce inequalities in access to training data — as today major commercial AI developers have the advantage of access to proprietary data. A commons-based approach also means collective and intentional management of the collection as a shared resource.

In the paper, we first explain why books matter for AI training and how broader access could be beneficial. We then summarize two tracks that might be considered for developing such a resource, highlighting existing projects that help foreground the potential challenges. One track relies on public domain and permissively licensed books, while the other depends on exceptions to copyright to enable training on in-copyright books. The report also presents several key design choices and next steps that could advance further development of this approach. These are based on the principles for commons-based data governance for AI, which we published in March 2024.

 

Read the paper

 

Alek Tarkowski
Paul Keller
with: Derek Slater (Proteus Strategies), Betsy Masiello (Proteus Strategies),
download as PDF:
comment on PubPub:
keep up to date
and subscribe
to our newsletter
Subscribe