Copyright challenges in open-source AI development in the European Union

The European Union wants to build digital sovereignty on open source AI, yet its own copyright rules stand in the way. European research consortia, backed by substantial public funding, are developing open source, public-interest Large Language Models (LLMs). At the same time, a web of copyright restrictions and legal uncertainties constrains the open source AI ecosystem they are trying to build.

This study is a joint effort by COMMUNIA, Centrum Cyfrowe, and Open Future. It is part of Open Future’s ongoing work on public AI, which examines how Europe can develop AI systems that are open, accountable, and aligned with the public interest. It relates to our previous work on open source AI, including an in-depth look at the development of two such models.

The study is an empirical mapping of the legal landscape. It draws on eight in-depth interviews with technical leads, principal investigators, and legal and data experts from European initiatives—OpenEuroLLM, Pleias, GAMS, PLLUM, SOOFI, GPT-NL, and an anonymized repository of climate change publications. The aim is to document the copyright challenges that shape, and often hinder, open source AI development in Europe.

A mismatch at the heart of the Directive

The Copyright in the Digital Single Market Directive (CDSMD) sets out two mandatory exceptions for text and data mining (TDM). Article 3 allows research organizations and heritage institutions to carry out TDM for scientific research. Article 4 allows anyone to carry out general-purpose TDM but includes an opt-out mechanism for rightholders.

This distinction creates friction. AI training carried out by research institutions routinely moves away from Article 3 and falls back on the more restrictive Article 4—the opposite of what the research exception was meant to enable.

What the interviews reveal

Unclear rules for training under the TDM exceptions emerge as the single biggest legal challenge for training open source LLMs. Compliance with opt-out requirements compounds the problem, made harder by the absence of standard, machine-readable information on rights reservations. Data sharing is also not harmonized across the EU, so researchers each work with their own crawl data—a duplication of effort that wastes public resources.

Recommendations

Drawing on interviewees’ observations, the study sets out four policy recommendations to strengthen the legal basis for open source model training, expand public-interest data sharing, and safeguard open science research outputs:

Clarify that AI training is protected TDM. EU law should explicitly state that training and developing AI systems are legitimate TDM activities under Articles 3 and 4 of the Directive.
Introduce a statutory right to share data for research. An explicit, unwaivable public-good exception should ensure that scientific research institutions can legally host, share, and republish curated training data sets for peer review, evaluation, and algorithmic validation.
Protect good-faith researchers and intermediaries. Researchers and public-interest intermediaries who follow standard compliance procedures should be shielded from statutory copyright claims and legal liability that would otherwise hinder their work on open source AI.
Build a European public training corpus. Europe needs a public training corpus as digital infrastructure, giving practical effect to the data-sharing provisions of Articles 3 and 4.

Read the study