Defining best practices for opting out of ML training

This Open Future policy brief examines the technical implementation of the EU law provision allowing authors and other rightholders to opt out of having their works used as training data for (generative) machine learning (ML) systems.

With the adoption of the Copyright in the Digital Single Market (CDSM) Directive in 2019, the European Union has established a regulatory framework for the use of copyrighted works as commercial data for machine learning: Articles 3 and 4 of the CDSM Directive introduced copyright exceptions for text and data mining (TDM), which authorize the types of reproductions made in the context of training ML models on publicly available copyrighted works. Together, the two articles provide a clear legal framework: academic and cultural institutions may freely use lawfully available works for ML training (Art. 3), while others may only do so if the rights holders have not reserved their use (Art. 4).

This EU framework is unique in the world because it respects the rights of creators to exclude their works from ML training data. This addresses concerns voiced by creators and other rightholders about the impact of machine learning on the creative process and their income while providing legal clarity to ML researchers and developers regarding the use of publicly available information to train their models.

However, it is currently unclear how opt-outs from ML training based on the machine-readable reservation of rights provided for in Article 4 will work in practice, as there are currently no generally recognized standards or protocols for the machine-readable expression of the reservation.

While there are several potential solutions that allow creators and other rightholders to communicate their rights reservations in a machine-readable format, it is unclear whether and how opt-outs expressed through these tools will be respected by ML model developers. As a result, there is significant uncertainty for creators and rights holders about the practical benefits of investing in working with any of these tools.

This ongoing lack of clarity on how the opt-out from Article 4 of the CDSM Directive can be used creates a risk that the balanced regulatory approach adopted by the EU in 2019 will not work in practice, which would likely lead to a reopening of substantive copyright legislation during the next mandate.

In this situation, there is a growing recognition among stakeholders of the need to identify best practices for the communication of opt-outs under Article 4 of the CDSM. Such best practices need to address both the supply side (providing certainty to creators and rights holders on how to express opt-outs) and the demand side (incentivizing entities developing ML models to respect opt-outs).

In this brief, we argue that this requires the intervention of an actor with sufficient credibility to provide guidance on how to express machine-readable rights reservations. In the current constellation, the entity best placed to take on this role is the European Commission, which is responsible for ensuring the implementation of the provisions of the CDSM Directive.

In the short term, the Commission should publicly identify data sources, protocols and standards that allow authors and rightholders to express a machine-readable rights reservation in accordance with Article 4(3) CDSM, that are freely available and whose functionality is publicly documented. Such an intervention would provide guidance to creators and other rightholders seeking means to opt out of ML training, and it would provide more certainty to ML developers seeking to understand what constitutes best efforts to comply with their obligations under Article 4(3) of the CDSM Directive.

Over time, this approach should be superseded by the emergence of a robust standard maintained independently of any direct stakeholders. It will be critical that both the standard and the technical and organizational infrastructures that support it are managed as a public good that is trusted by all relevant stakeholders.

Read the brief