Outline for a European Books Data Commons

This new concept paper presents an outline for establishing a European Books Data Commons (EBDC)—a piece of public digital infrastructure designed to provide centralized access to large, high-quality datasets of digitized books from European libraries. It is conceived as a commons-based infrastructure governed collectively by the contributing libraries.

Authored by Paul Keller and building on a series of structured conversations about the idea of a European Book Data Commons that we convened together with Europeana during the first half of 2025, this paper addresses a critical gap in how Europe manages its digitized cultural heritage in the age of AI. It also builds on earlier work presented in Towards a Books Data Commons for AI Training, which explored the broader concept of creating shared infrastructure for making book collections available for AI model development while ensuring that libraries and cultural heritage institutions maintain control over their digitized materials and can fulfill their public service missions.

The EBDC proposal responds to the challenge that many European libraries face: their digitized collections of public domain books remain largely inaccessible for AI training and other innovative uses. By creating shared public infrastructure under library control, the EBDC would enable these institutions to optimize their collections for diverse uses—from individual access to bulk data provision for AI model development—while maintaining clear provenance and data quality.

Key Features of the Proposed EBDC

The paper outlines several core elements:

Public infrastructure under library control—allowing libraries to store and manage their digitized collections on infrastructure they govern, reducing dependence on commercial providers.
Centralized access mechanism—providing researchers and AI developers with a single point of access to European digitized public domain books, reducing redundant scraping and ensuring data authenticity.
Language diversity—bringing together collections from multiple library partners increases the availability of high-quality, linguistically diverse datasets for AI training.
Technical architecture—distributed storage combined with unified APIs, data-processing pipelines compatible with existing tools like those from the Institutional Data Initiative, and European data sovereignty considerations.

The EBDC is positioned as an independent service within the common European data space for cultural heritage, complementing existing infrastructure like Europeana while addressing the specific need for full digital artifacts optimized for AI training and other computational uses.

Alignment with European Policy Objectives

The proposal aligns with multiple EU policy priorities: it strengthens digital sovereignty by creating European-controlled infrastructure for cultural heritage data; it supports the AI Continent Action Plan by providing high-quality training datasets; and it advances core objectives of the just released Data Union Strategy.

Critically, the paper also addresses sustainability challenges. With estimated annual operating costs between €500k and €750k, the EBDC will require mechanisms to secure long-term funding—potentially through contributions from commercial AI developers who benefit from access to these datasets, implementing principles of conditional openness and commons-based governance.

Realizing the EBDC would be an important step toward ensuring European libraries can fully participate in the AI era—maintaining control over their collections while fulfilling their public service missions.. It calls for coordinated action among libraries, policymakers, and technology providers to turn this vision into reality.

Download the Paper