The European Union wants to build digital sovereignty on open source AI, yet its own copyright rules stand in the way. European research consortia, backed by substantial public funding, are developing open source, public-interest Large Language Models (LLMs). At the same time, a web of copyright restrictions and legal uncertainties constrains the open source AI ecosystem they are trying to build.
This study is a joint effort by COMMUNIA, Centrum Cyfrowe, and Open Future. It is part of Open Future’s ongoing work on public AI, which examines how Europe can develop AI systems that are open, accountable, and aligned with the public interest. It relates to our previous work on open source AI, including an in-depth look at the development of two such models.
The study is an empirical mapping of the legal landscape. It draws on eight in-depth interviews with technical leads, principal investigators, and legal and data experts from European initiatives—OpenEuroLLM, Pleias, GAMS, PLLUM, SOOFI, GPT-NL, and an anonymized repository of climate change publications. The aim is to document the copyright challenges that shape, and often hinder, open source AI development in Europe.
The Copyright in the Digital Single Market Directive (CDSMD) sets out two mandatory exceptions for text and data mining (TDM). Article 3 allows research organizations and heritage institutions to carry out TDM for scientific research. Article 4 allows anyone to carry out general-purpose TDM but includes an opt-out mechanism for rightholders.
This distinction creates friction. AI training carried out by research institutions routinely moves away from Article 3 and falls back on the more restrictive Article 4—the opposite of what the research exception was meant to enable.
Unclear rules for training under the TDM exceptions emerge as the single biggest legal challenge for training open source LLMs. Compliance with opt-out requirements compounds the problem, made harder by the absence of standard, machine-readable information on rights reservations. Data sharing is also not harmonized across the EU, so researchers each work with their own crawl data—a duplication of effort that wastes public resources.
Drawing on interviewees’ observations, the study sets out four policy recommendations to strengthen the legal basis for open source model training, expand public-interest data sharing, and safeguard open science research outputs: