5 + 3 + 3 = 0 transparency

Opinion
December 9, 2025

OpenAI’s GPT-5 family, Google’s Gemini 3 line, and, just last week, Mistral’s 3-series models. These high-profile — and undoubtedly many other — GPAI models have been released in the EU since the AI Act’s provisions for providers of General Purpose AI models entered into force on 2 August.

What we have not seen, however— and despite the clear obligation in Article 53(1)(d) of the AI Act — are any meaningful public summaries of the data used to train these models.

Both OpenAI (which released GPT-5 on 7 August, just days after the new rules took effect) and Google (which launched Gemini 3 in November) have published very brief descriptions of their training data in the model cards accompanying the models. This is largely consistent with what they did for previous versions. Mistral, for its part, included a section titled “Information on the data used for training, testing, and validation” in the technical documentation for the new models. In all three cases, the content does not remotely resemble the level of detail required by the template published by the AI Office.

Here are the respective passages that describe the training data for the three models, in full:

Like OpenAI’s other models, the GPT-5 models were trained on diverse datasets, including information that is publicly available on the internet, information that we partner with third parties to access, and information that our users or human trainers and researchers provide or generate. (GPT-5 System Card)

This model was trained, tested and validated on a diverse text and image dataset, encompassing multiple languages and geographies and curated from a variety of sources to ensure broad coverage and high-quality learning. This included publicly-available information from the internet, non-public datasets licensed from third-parties, as well as data generated synthetically internally. This model was also trained using user generated input and output from Mistral AI products such as Le Chat or Mistral AI Studio. (Mistral 3 Large technical documentation)

The training dataset also includes: publicly available datasets that are readily downloadable; data obtained by crawlers; licensed data obtained via commercial licensing agreements; user data (i.e., data collected from users of Google products and services to train AI models, along with user interactions with the model) in accordance with Google’s relevant terms of service, privacy policy, service-specific policies, and pursuant to user controls, where appropriate; other datasets that Google acquires or generates in the course of its business operations, or directly from its workforce; and AI-generated synthetic data. (Gemini 3 Pro Model Card)

As we have argued here before, the AI Office’s training-data template released this summer is far from perfect, but that does not mean that GPAI model providers should be able to ignore it at will.

Compared to our proposed blueprint for the template, the version published by the AI Office significantly dilutes the level of transparency needed to hold GPAI providers accountable. It accepts broad, approximate ranges instead of the precise, quantitative disclosures of dataset size, composition, and provenance—leaving considerable room for ambiguity. Yet even this reduced transparency standard is not being met by major providers releasing new GPAI models in the EU.

This reluctance to provide meaningful summaries of the training data is even more problematic because all three companies have chosen to sign the Code of Practice for GPAI model providers as a means of demonstrating compliance with their obligations under the AI Act, through which they have explicitly acknowledged and committed themselves to the training-data-summary obligation set out in Article 53(1)(d). The current situation suggests, at minimum, a concerning lack of regard for both the Code and the underlying legal obligations. The European Commission should not let this behavior slide and should remind all model providers of their obligations under the AI Act. This seems especially important given that training-data transparency is one of the few areas where there is broad political consensus that the existing rules are too weak.

A look behind the veil — “you need all the data”

This leaves open the question of what GPAI model providers seek to achieve by refusing to publish the transparency summaries.

One plausible explanation for this behavior is that providers are effectively buying time and are engaged in a game of chicken, waiting for someone else to publish the first fully compliant report. Enforcement by the AI Office will not begin until August 2026, a full year after transparency obligations formally take effect for new models. This creates a window in which providers can delay or minimize disclosures without facing consequences.

But in addition to buying time, another factor may be at play.

These companies might be reluctant to admit the open secret that, at the level of core pre-training, developers of generative models all rely on essentially the same underlying corpus of publicly available data, drawing on whatever material they have been able to access.

Properly completed transparency reports would not reveal any meaningful competitive secrets, trade secrets, or confidential information. But they could make this convergence impossible to ignore. By now, it is fairly clear that all major providers draw from more or less the same corpus for pre-training core models: all of the (publicly available) data.

If this is indeed the case, transparency reports will almost certainly reflect it—showing near-identical lists of “the top 10% of domain names determined by the volume of content scraped.” Any significant divergence between providers would be surprising.

We have pointed out before that the template published by the AI office might fail to achieve the goal assigned to the training data summaries by the AI Act, namely to “facilitate parties with legitimate interests, including copyright holders, to exercise and enforce their rights under Union law.”

But even if these transparency reports do not give individuals sufficiently meaningful insight into whether their own (personal) data was used, and therefore do not enable them to exercise their rights, they would still deepen our collective understanding of how the technology is built. If they show that all major models have been trained on essentially the same global corpus, that might prompt us to reconsider our regulatory focus. Perhaps rather than dedicating all effort and resources to fighting a rearguard action—retrofitting data-acquisition practices into the existing complexities of copyright and related legal frameworks—we should also look ahead and ensure that the value generated in this process is distributed fairly to sustain the public information ecosystem on which these models depend.

Zuzanna Warso
Paul Keller
with:
keep up to date
and subscribe
to our newsletter
Subscribe