Last week the AI Office presented the first outline of the template for a summary of training data. It will need to be used by GPAI model providers to publicly document the data they use to train their models once the relevant provisions of the AI Act come into effect in August this year. At the same time, the AI office asked stakeholders to provide written feedback on the outline by Friday, the 31st of January 12:00 CET.
While we are pleased to see a number of the suggestions that we made in our blueprint for the template reflected in the outline, we also see a number of issues where the approach proposed by the AI office falls short of its potential to enable parties with legitimate interest to exercise their rights under the EU law.
Based on our previous analysis, together with Mozilla Foundation, we have prepared the following feedback for the AI Office (note that the AI Office had indicated that the responses should be limited to 500 words in response to each of the questions):
The approach proposed by the AI Office requests that the providers submit information about the model’s placement on the market and knowledge cut-off date. In Section 2, the approach asks for the “period of collection,” but only for datasets that exceed a certain threshold.
Information about the date range of the training data is important from the perspective of model reliability and upholding consumer protection. However without proper granularity this information may be of less use. A broad and partial date range makes it more difficult to identify potential reliability issues or biases, or assess the suitability of the model for a specific task. We recommend, therefore, that the template asks for the date range of the training data for each data source.
Further in this section, different types of metrics are used to report on the overall size of the different types/modalities of the data (for text – tokens or bytes, for image – number of images, for video – number of minutes, for audio – number of minutes). We appreciate the granularity of the approach presented by the AI Office in terms of how providers are meant to report on the types/modalities of the training data and their characteristics. We recommend including the size in GB as well as the number of tokens, where applicable, as complementary metrics for all modalities to make them comparable and include consistent units of measurement. This should be reported at the level of a dataset.
In addition, information on the number of tokens must be accompanied by information about the tokenization process in the form of links to the tokenizer, if publicly accessible, or a description of how data is tokenized. Without this information, the sole number of tokens provides limited value because the tokenization process affects the reported size and characteristics of the dataset. Different tokenization methods (e.g., in the case of text, word-level, subword-level, or character-level) produce different token counts, which affects comparability across models and providers. In addition, tokenization decisions can introduce linguistic and cultural bias, such as misrepresenting certain languages, so it is important to disclose the methodology and token count to ensure transparency.
Finally, this section also asks the providers to submit a “description of the linguistic, regional, demographic and other relevant characteristics of the overall training data.” While this information is important from the point of view of training data’s representatives and has consequences for equal treatment, without knowing the stage at which a given data set was used (e.g., pre-training or fine-tuning), it becomes difficult to assess its impact on the model’s behavior and performance. We recommend, therefore, that the template asks for the specific information on steps taken to ensure diversity and representativeness of training data across relevant categories (e.g., demographics, languages) at the different stages of model training.
The AI Office’s approach does not ask the GPAI model providers to publish a list of all datasets used to train the model. Instead, for each data source it requests the disclosure of a subset of “main/large datasets” that exceed a certain threshold.
The lack of access to a complete list of datasets diminishes the ability of parties with legitimate interests to exercise their rights under EU law, including the protection of copyright and personal data. As a result, this approach would lead to reduced transparency and accountability of GPAI model providers. Moreover, a threshold-based reporting obligation can be easily circumvented if it cannot be validated against a full list of all datasets used. For instance, a single large dataset, such as CommonCrawl, could be split into smaller datasets by language or top-level domain (like .be and .com), allowing providers to avoid disclosure under the threshold.
Another concern with a dataset disclosure threshold is that it would favor certain types/modalities of data. For example, text datasets are likely to be smaller in size than video or image datasets, including where they come from the same source. As a result, a single threshold per data source could lead to a skewed representation of what is disclosed, with text-based records being underreported compared to visual or multimedia records. To address concerns about skewed disclosure of datasets due to different types/modalities, the AI Office should implement a reporting mechanism that accounts for these differences. We recommend, therefore, that the template asks for the list of all data sets that were used in training the model. At the very least, thresholds under 2.1 and 2.2 should be set significantly lower, as datasets that do not cross the stipulated 5% threshold are still likely to be of significant size and relevance to parties with legitimate interests.
The ability to exercise rights under the EU law depends on having access to different types of information about the content used to train GPAI models, including information about the (pre-)processing steps. Processing decisions alter the composition of the training data and impact the results the models create, as well as their risk profiles. As a result, information about these decisions is relevant for assessing whether a party has a right that can be exercised and enforced under the EU law. For example, information about the anonymization techniques implemented by the GPAI model provider, coupled with the knowledge of the data sources and the data collection cut-offs, would enable data-subjects to understand whether their personal data may have been used in the context of GPAI model training and whether their rights under the EU law may have been interfered with. This information about data processing would allow data subjects to better exercise their rights under the GDPR or for privacy and security experts to scrutinize the security of a GPAI model or privacy claims made by their developers, which is precisely the purpose the transparency template is intended to serve. Similarly, further information on the filtering methods can help assess whether and how model providers are seeking to prevent their models from generating harmful or discriminatory outputs, which could support the work of consumer protection groups or fundamental rights bodies, and aid affected individuals more broadly. This information can thus support the legitimate interest of preventing discrimination and respecting cultural diversity.
We welcome the measures presented in the AI Office’s approach related to the respect of copyright and related rights. However, for the reasons outlined above, information about data processing must not be limited to supporting the protection of copyright. There are other rights at play here, too, the protection and exercise of which may require further information to be provided.
Further, the fact that the approach allows GPAI model providers to define by themselves what they consider “unwanted” as part of the training data and does not explicitly address anonymization or privacy-related (pre-)processing measures, leaves too much discretion to the model providers. If implemented, it will not provide sufficient legal certainty for rightsholders, data subjects, or other parties with legitimate interests, such as researchers or consumer organizations, as to whether their rights are respected by providers of GPAI models.
If the template does not comprise sufficient technical detail related to how the training data is (pre-)processed, it may become a performative exercise and an unnecessary bureaucratic burden on companies without providing value to relevant interest holders. To truly enable people to protect their legitimate interests, the template must request sufficient details (please see section 4 of the Blueprint of the template for the summary of content used to train general-purpose AI models (Article 53(1)d AIA) – v.2.0).