Considerations for implementing rightholder opt-outs by AI model developers

This policy brief further develops the ideas expressed in our previous policy brief on this topic in light of the copyright provisions of the AI Act. Article 53(1c) of the AI Act requires providers of general-purpose AI models to implement policies to comply with EU copyright law, particularly, with the machine-readable opt-outs from the text and data mining (TDM) exception. This new Open Future policy brief explores what such a compliance policy might look like in practice. It provides an overview of the technical standards and services that are available to implement rights holders’ opt-outs in a way that is effective, scalable, and able to meet the needs of both rights holders and AI model developers.

The brief argues that to achieve this goal, four different aspects of machine-readable opt-outs require further attention: the identifiers for works, the vocabulary for opt-outs, the infrastructure used to communicate and respect opt-outs, and the effect of an opt-out once it has been recorded. For each of these four areas, there is a need to build consensus and converge on solutions that work for all stakeholders.

With regard to identifiers, the policy brief highlights that there are currently two dominant approaches: so-called location-based identifiers (such as robots.txt) and unit-based identifiers (in the form of metadata embedded in or associated with files), both of which have advantages and disadvantages. The brief argues that for opt-out compliance policies to be effective, both approaches need to be considered, and a situation in which there is agreement on a limited number of standardized identifiers should be desirable from the perspective of both rightholders and AI model trainers.

Secondly, the policy brief points out that there is currently no consistent vocabulary for expressing the scope of opt-outs. Rightholders have made it clear that while they want to be able to opt out of the training of generative AI models, they do not wish to opt out of their works being used by other forms of AI, especially when such technologies are used for search and discovery of their content. This means that there is a need to develop and agree on a vocabulary (or taxonomy) of uses from which rightholders can opt out that is more granular than the binary approach of either opting out of all AI or declaring no opt-out. In the policy brief, we argue that it is desirable that compliance policies should be based on a vocabulary that distinguishes between a full TDM opt-out and a more limited opt-out from training generative AI models.

Regarding infrastructure, the brief notes that location-based and some unit-based identification schemes do not require dedicated infrastructure. However, unit-based approaches that rely on content-based identifiers require some form of registry where opt-outs are recorded. A registry (or federation of registries) would allow rightholders to record opt-outs and allow model trainers to check the registry for known opt-outs. The policy brief argues that there is a need for a public registry infrastructure based on standardized identifiers (such as ISCC codes).

Finally, the brief looks at the effects of opt-outs: at the simplest level, it seems clear that opted-out works may not be added to the training data used to train new generative AI models. At the same time, it is also clear that opt-outs can only apply to training that occurs after an opt-out request is received. However, given the scale of AI training data collection, it is relatively likely that training datasets will contain multiple expressions of the same work. Here, the policy brief argues that model trainers should also make efforts to identify other instances of the opted-out work in the data they use to train future models.

The policy brief concludes by outlining a number of next steps for developing a framework for opt-out compliance: the most obvious first step is to define a common vocabulary for opt-outs supported by rightholders and AI model trainers and to develop generic user agents for robots.txt based on this vocabulary that allow rightholders to horizontally opt out of AI training. At the same time, all stakeholders should explore options for building a public registry infrastructure for entity-based opt-outs based on standardized identifiers.

Read the brief