Towards Robust Training Data Transparency

Transparency of the data used to train AI models is a prerequisite for understanding how these models work. It is crucial for improving accountability in AI development and can strengthen people’s ability to exercise their fundamental rights. Yet, opacity in training data is often used to protect AI-developing companies from scrutiny and competition, affecting both copyright holders and anyone else trying to get a better understanding of how these models function.

As the European Union’s AI Office, a body that will play a key role in implementing the AI Act – especially for general-purpose AI (GPAI) – is taking its final shape, we are sharing a policy brief arguing for a strong transparency requirement for content used to train GPAI models. This paper and the accompanying blueprint of the transparency template that the AI office is tasked to develop is a collaborative effort of Open Future and Mozilla Foundation, drawing on input from experts. In September 2024 we published an updated version of this blueprint that incorporates additional feedback from experts.

Background

According to Article 53 1 (d) of the AI Act, providers of GPAI models are expected to draw up and make publicly available a sufficiently detailed summary of the content used for training of the general-purpose AI model.

These summaries must present an overview of the data sources and sets involved, including private and public databases, and include narrative explanations. They should be prepared according to a template provided by the AI Office. This requirement had originally been introduced into the AI act in response to demands by organizations representing copyright holders but has since been expanded into a broader transparency obligation.

The AI Act’s preamble states that the purpose of these “sufficiently detailed summaries” is to facilitate the exercise and enforcement of rights under Union law by parties with a legitimate interest. The legitimate interest may relate to the protection of copyright, which is explicitly mentioned in the recital. However, our paper argues that the range of legitimate interests of parties interested in increased transparency of data used in the development of GPAI goes beyond copyright issues.

The purpose of the paper we are sharing today is twofold. It clarifies the categories of rights and legitimate interests that justify access to information about training data. In addition to copyright, these include, among others, privacy and personal data protection, scientific freedom, the prohibition of discrimination, and respect for cultural and linguistic diversity. Moreover, it provides a blueprint for the forthcoming template for the “sufficiently detailed summary,” which is intended to serve these interests while respecting the rights of all parties concerned.

Next steps

The AI Act was signed into law on 13 June 2024. It enters into force 20 days after publication in the Official Journal of the European Union, which is expected to happen sometime in July. GPAI rules will take effect within 12 months thereafter. Consequently, Providers of GPAI will be required to publish data summaries starting in mid-2025

By requiring a detailed summary of the data used to train GPAI, the AI Act has provided the basis for a mechanism to meaningfully increase the transparency of AI development. The AI Act made it clear that the summary should protect the legitimate interests of all affected parties and needs to be meaningful and comprehensive.

The blueprint for the template outlined in this brief, developed in collaboration with experts from various sectors and disciplines, sets out what an effective summary and meaningful documentation of training data should look like. We intend it to serve as input to discussions on this issue and as a baseline for the AI office’s implementation work in developing the template.

Read the brief Blueprint v2.0