On November 20, 2024, Open Future and Mozilla Foundation co-hosted a panel discussion on transparency requirements for AI training data. The speakers included Maximilian Gahntz from the Mozilla Foundation, Sabrina Küspert from the European AI Office, Ania Helseth from Meta, and Abbas Lightwalla From the International Federation of Phonographic Industry (IFPI). The panel was moderated by our Director of Research, Zuzanna Warso.
The idea that “data is the lifeblood of AI” is now an accepted truth, but despite this general agreement, knowledge of what content is used to train popular general-purpose AI models is still limited. As the drafting process for the Code of Practice for General Purpose AI (GPAI) continues (the first draft was published on November 14th), it is up to the EU’s AI Office to suggest what a “sufficiently detailed summary” of the content used to train GPAI models under Art. 53 (1) d AIA should look like.
Earlier this year, together with the Mozilla Foundation, we published a policy brief accompanied by a template for the “sufficiently detailed summary” of the training data.
Here is a brief summary of the main issues raised by the speakers:
- MEP Sergey Lagodinsky (Greens/EFA, DE) opened the event by reflecting on how democratic values, including transparency and trust, are central to the global confrontation between liberal democracies and authoritarian regimes. He emphasized three main concerns regarding the AI Act’s implementation: the environmental impact and resource consumption of AI systems; the challenges facing the AI Office in managing its wide-ranging responsibilities; and the need to ensure sufficient data access for researchers to analyze and guide responsible AI practices.
- Sabrina Küspert from the European AI Office outlined the Office’s role in coordinating the complex Code of Practice consultation process. She encouraged all stakeholders to provide detailed feedback to ensure the adoption of implementable rules. As far as the template for the sufficiently detailed summary is concerned, Sabrina emphasized that the AI Office is developing a training content template that will serve as a baseline for Working Groups. She highlighted that the template needs to balance various stakeholders’ interests while ensuring that GPAI providers will be able to implement it.
- Maximilian Gahntz stressed that transparency in AI training data goes beyond copyright concerns, encompassing, among other things, privacy considerations and public interest research needs. These legitimate interests depend on knowing the size of the training data, its sources, and how the data is processed, including steps like anonymization and filtering.
Abbas Lightwalla and Ania Helseth provided industry views:
- Ania offered insights from the perspective of an AI model provider. She highlighted the indispensability of large-scale, publicly available data for AI development and the need to protect business interests through trade secrets while ensuring compliance with transparency goals.
- Abbas emphasized that transparency enables the exercise of existing rights and levels the playing field between actors. He disagreed that trade secrets and commercial confidentiality should be an obstacle to transparency. He noted that many developers already disclose training data, proving that it is possible.
The Brussels event was a timely opportunity for more than 70 policymakers, industry representatives and civil society to exchange views on what form meaningful transparency requirements for content used to train AI models should take. In terms of the next steps, according to information shared with participants in the Code of Conduct development process, the first draft of the sufficiently detailed summary will be shared by the AI Office before the next round of stakeholder consultations in January.