Access to AI training datasets and transparency regarding dataset characteristics are a key mechanism that ensures responsible AI development. Information on provenance, creation, and use of datasets for AI training allows for greater accountability and mitigation of negative impacts such as AI bias. Allowing access to these datasets allows for greater reproducibility of research, but also levels the playing field in terms of market competition.
At least since 2018, when the “Datasheets for datasets” paper was published, there is seemingly a consensus that dataset transparency is a key principle of responsible AI. Yet reality looks very different. All major commercial models released over the last year, including GPT4, Llama 2, or Gemini do not disclose information about or make their training datasets available. Mistral large, a model from the French developer that just backed out of a commitment to open-sourcing AI, was also released without any information about the @datait was trained on.
Against this backdrop of a clear market failure, two paths to responsible dataset governance can be charted. One relies on community norms of open source development and open science, which are expressed in the development of open AI systems. The second assumes that transparency is mandated through legislation.
The European AI Act, now nearly finalized, includes such transparency requirements for so-called general-purpose AI systems. It also includes exemptions to these rules for models shared under free and open source licenses.
The AI Act is important for open AI development not just because of the exemptions – more importantly, it sets a precedent in defining open AI development itself.
At Open Future, we have been paying attention to the issue of AI training datasets: to the role they play in AI development and to specific dataset governance mechanisms that need to be part of an open approach to AI.
The emergence of BLOOM and Stable diffusion open AI models in mid-2022 launched a debate on what open AI, or open source AI means, and what the norms of open AI development are. Two structured processes aim to provide these definitions: the Open Source Initiative works towards defining open-source AI systems, and the Digital Public Goods Alliance has set up a community of practice to understand how its standard for digital public goods can be applied to AI systems.
As these debates continue, the European legislators, through the AI Act, have been the first to settle on a definition. One that will be mandated in the European Union but also resonate — in a typical “Brussels effect” fashion — beyond its borders.
In recital 60i, the AI Act provides a definition of open source AI systems as:
Software and data, including models, released under a free and open-source licence that allows them to be openly shared and where users can freely access, use, modify and redistribute them or modified versions thereof, can contribute to research and innovation in the market and can provide significant growth opportunities for the Union economy. General purpose AI models released under free and open-source licences should be considered to ensure high levels of transparency and openness if their parameters, including the weights, the information on the model architecture, and the information on model usage are made publicly available. The licence should be considered free and open-source also when it allows users to run, copy, distribute, study, change and improve software and data, including models under the condition that the original provider of the model is credited, the identical or comparable terms of distribution are respected.
This recital covers two important issues. First, it provides a definition of a “free and open-source license.” (Kate Downing wrote a detailed critique of this definition, showing how — in particular — it fails to account for the emergence of responsible AI licensing). Secondly, it states that if weights, model architecture and information on model usage of an AI model are made publicly available under a free and open-source license, then these models ensure a high level of openness and transparency.
A similar definition of open source general purpose AI is provided in Article 52c :
general purpose AI models that are made accessible to the public under a free and open source licence that allows for the access, usage, modification, and distribution of the model, and whose parameters, including the weights, the information on the model architecture, and the information on model usage, are made publicly available (…).
Missing from these definitions is any reference to AI training datasets, as key components of AI systems. Data is listed as a type of “free and open source AI components” in Recital 60i+1, but the definition does not necessarily relate to training datasets:
Free and open-source AI components cover the software and data, including models and general purpose AI models, tools, services or processes of an AI system.
Nevertheless, it is not included in the definitions related to the exemptions for open-source models. As a result, the AI Act provides no regulatory incentive to make training datasets available. This seems to contrast with the statement in recital 60i that free or open-source licensed AI “should be considered to ensure high levels of transparency and openness” — as the issue of datasets, a key component in AI development, is not addressed.
It is as if dataset access and transparency were taken for granted. The European legislators trusted open AI developers to act based on community norms. Release history of models that either are open source or claim to be so (running the risk of open-washing), such as Llama 2, Falcon, or Mistral shows that no such norms exist. In each case, there is barely any information about the training datasets.
Although the AI Act does not acknowledge — due to the way that it defines open-source AI — that transparency and access to datasets are a key element of open AI development, it establishes at the same time rules that will apply to open-source AI.
These can be seen as controversial by the open source community, used to the idea that it governs itself, through community norms. This community’s expectations in terms of legislation are defined, by and large, by exemptions and carve-outs, which indeed are often needed in order not to overly burden open source developers. Bert Hubert makes this point in a recent analysis of the Cyber Resilience Act, in which he argues that some of the new regulations are welcome:
I realize the above will not satisfy everyone in the open source world. Some feel that being open source should come with a blanket opt-out from any form of regulation.
Open Future, in its advocacy work on the AI Act, also assumed that self-governance and state regulation can go hand-in-hand and aim to achieve similar governance goals. This was expressed in a position paper on “Supporting open source and open science in the EU AI Act,” published in July 2023.
The paper proposed a proportional approach suited for different types of general purpose AI systems, but with a baseline that applies to all of them, including open-source AI:
Baseline requirements should apply to all foundation models that are put into service or made available on the market, and should ensure meaningful transparency, data governance, technical documentation, and risk assessment.
This proportional approach was adopted in the AI Act. Yet when it comes to dataset transparency, the overall standard is not ambitious — and open-source AI is largely exempt.
Although the AI Act sets (in Article 52) general transparency obligations for all AI models, these do not cover dataset transparency (or access). Providers of general purpose AI systems are obligated to prepare information on datasets and training and share it with the AI Office, upon request (Article 52c and Annex IXa). The scope of the information to be shared is broad:
information on the data used for training, testing and validation, where applicable, including type and provenance of data and curation methodologies (e.g. cleaning, filtering etc), the number of data points, their scope and main characteristics; how the data was obtained and selected as well as all other measures to detect the unsuitability of data sources and methods to detect identifiable biases, where applicable;
Yet this information will not be shared publicly — unless the AI Office decides to make it available. The Act does not clarify whether this will be possible. More importantly, general purpose AI systems are exempt from this requirement if they meet the definition of open-source AI.
A requirement to make information about datasets and training publicly available is introduced solely for high-risk AI systems, including open-source AI (Article 11(1) and Annex IV). Their technical documentation needs to include:
datasheets describing the training methodologies and techniques and the training data sets used, including a general description of these data sets, information about their provenance, scope and main characteristics; how the data was obtained and selected; labelling procedures (e.g. for supervised learning), data cleaning methodologies (e.g. outliers detection);
Finally, relevant provisions for the transparency of datasets are included among copyright-related obligations for general purpose AI systems, which have been covered by Paul Keller in his recent opinion. These requirements also apply to open source AI. It remains to be seen whether these will cover all the elements that a state-of-the-art datasheet for datasets should include.
A relatively weak transparency mandate, in the case of datasets used in training open-source AI, is coupled with a declaration of faith in self-regulation. Overall, the Act encourages providers of general-purpose AI models to agree on codes of practice, through processes facilitated by the AI Office (recital 60s, article 52e).
And developers of open-source AI other than general purpose AI systems are encouraged (recital 57e) to :
implement widely adopted documentation practices, such as model cards and data sheets, as a way to accelerate information sharing along the AI value chain, allowing the promotion of trustworthy AI systems in the Union.
Overall, the AI Act does not introduce meaningful obligations for training data transparency, despite the fact that they are crucial to the socially responsible development of what the Act defines as general purpose AI systems.
Over the coming months, the implementation of the AI Act will start, and the rules for dataset transparency will be further qualified. This is also a time when community conversations on open AI development will continue.
There is an urgent need to address the issue and set a clear standard for transparency with regard to AI training, and access to training datasets. Today, models that claim to be open source, such as Llama 2, Falcon, or Mistral, are being released without providing access to training datasets or even basic information about them.
Defining how openness can be ensured in the context of AI development hinges on an understanding of the complex relationship between data used for training and the resulting AI models. Questions to be resolved include those on the status of models as potential derivatives of datasets, on the applicability of copyleft obligations, on ways in which data is represented in the model, and the extent to which training data is necessary to guarantee the freedoms that open source and free software frameworks were designed to ensure. The European policymakers avoided answering these questions. It is up to the ongoing community debates — both among open source development communities and among stewards of various datasets — to provide necessary definitions and recommendations.