Making generative AI work for creators and the commons
This line of our work explores the consequences of the fact that machines can now consume human creativity, reassemble it, and spit out synthetic content that closely resembles the creative output previously produced by human creators.
The arrival of powerful generative machine learning models in 2022 raised important questions about their impact on creators and other rightholders. Will generative AI systems replace human creators? How will they affect the income of creators and other cultural producers? Do AI companies have the right to use copyright-protected works as training data for their models, and if so, under what conditions? And what does the emergence of generative AI tell us about the limits of copyright?
Our work in this area is guided by the objective of making AI work for both creators and the Digital Commons.
The blog post argues that with increasing convergence on creator/rightholder opt-outs as an essential mechanism in the governance of generative AI models, there is an urgent need for standardization of machine readable opt outs.
Ahead of this year's Creative Commons Summit in Mexico City, Open Future and Creative Commons hosted a one-day workshop to discuss the impact of generative AI on creators and the commons. The workshop explored how legal and regulatory contexts differ around the world and how this affects the development of shared strategies for dealing with the impact of generative AI on the commons and the position of creators.
Based on this discussion, and in subsequent conversations over the three days of the summit, the group identified a set of seven principles that could guide further work on creating an equitable framework for the regulation of generative AI around the world. These principles were published as part of a statement on "Making AI work for Creators and the Commons" which was published on the Creative Commons blog on the final day of the Summit.
This Open Future policy brief examines the technical implementation of the EU law provision allowing authors and other rightholders to opt out of having their works used as training data for (generative) machine learning (ML) systems.
Today, Open AI announced that GPTBot, the web crawler used to collect training data for its GPT series of large language models, can now be blocked via the robots.txt protocol. Site administrators can either disallow crawling of entire sites or create custom rules that allow `GPTBot` access to some parts of a site while blocking it from others. This functionality gives site owners a level of control over how their content is used by OpenAI's LLMs that they previously lacked.
At first glance, OpenAI's approach follows the opt-out mechanism established by the TDM exceptions in the EU copyright framework. But on closer inspection, the model/vendor-specific nature of this approach raises more questions than it answers, as it implies that it is the responsibility of website publishers to set rules for each individual ML training crawler operating on the web, rather than setting default permissions that apply to all ML training crawlers.
There has been a lot of attention on copyright and generative AI/ML over the last few months. In this essay, I propose a two-fold strategy to tackle this situation. First, it is essential to guarantee that individual creators can opt out of having their works used in AI training. Second, we should implement a levy that redirects a portion of the surplus from training AI on humanity's collective creativity back to the commons.
According to the announcement, 40,000+ individual artworks have been opted out from use for ML training via the haveibeentrained.com tool. The remaining 79 million+ opt-outs were registered through partnerships with platforms (such as ArtStation) and large rightholders (such as Shutterstock).
These opt-outs are for images included in the LAION 5B dataset used to train the Stable Diffusion text-to-image model. Stability AI has announced that the opt-outs collected by spawning.ai and made available via an API will be respected in the upcoming training of Stable Diffusion V3.
As we have previously argued, such opt-outs are supported by the EU's legal framework for machine learning, which allows rights holders to reserve the right to text and data mining carried out for all purposes except academic research undertaken by academic reserach institutions. Spawning.ai is the first large-scale initiative to leverage this framework to offer creators and other rights holders the ability to exclude their works from being used for machine learning training.
As generative machine learning (ML) becomes more widespread, the issue of copyright and ML input is back in focus. This post explores the EU legal framework governing the use of copyrighted works for training ML systems and the potential for collective action by artists and creators.