AI and Creative Labor

Making generative AI work for creators and the commons

This line of our work explores the consequences of the fact that machines can now consume human creativity, reassemble it, and spit out synthetic content that closely resembles the creative output previously produced by human creators.

The arrival of powerful generative machine learning models in 2022 raised important questions about their impact on creators and other rightholders. Will generative AI systems replace human creators? How will they affect the income of creators and other cultural producers? Do AI companies have the right to use copyright-protected works as training data for their models, and if so, under what conditions? And what does the emergence of generative AI tell us about the limits of copyright?

Our work in this area is guided by the objective of making AI work for both creators and the Digital Commons.

Timeline

Spawning has released PD12M, a fully open dataset consisting of 12.4 million image-caption pairs. The dataset exclusively consists of public domain and CC0 licensed images that have been obtained from Wikimedia Commons, a large number of cultural heritage organizations, and the iNaturalist website. From the paper accompanying the release:
We present Public Domain 12M (PD12M), a dataset of 12.4 million high-quality public domain and CC0-licensed images with synthetic captions, designed for training text-to-image models. PD12M is the largest public domain image-text dataset to date, with sufficient size to train foundation models while minimizing copyright concerns. Through the Source.Plus platform, we also introduce novel, community-driven dataset governance mechanisms that reduce harm and support reproducibility over time.
The release of PD12M is remarkable not only given the size of the fully open dataset but also because of the holistic approach that Spawning has taken. Via the source.plus platform, Spawning provides community-based governance mechanisms. In addition, the platform also provides an exemplary level of transparency regarding the sources of the images included in the dataset. The release of PD12M is exciting not only because it builds on our ideas for a public data commons but also because Spawning sees the release of the dataset as a first step towards offering a foundational public domain image model with no IP concerns, that will help artists to fine-tune, and own, their own models on their own terms.
Last week, the Data Provenance Initiative at MIT released a new paper by Shayne Longpre et al. that shows a dramatic increase in restrictions on the use of publicly available content as AI training data. This first large-scale longitudinal study of the restrictions placed on online content via robots.txt and terms of service shows that over the past year, restrictions on online content included in a number of commonly used AI training datasets have increased dramatically. From virtually no restrictions on use as training data just a year ago, the researchers found that more than 28% of the most actively maintained, critical sources for the C4 training dataset are now completely restricted from use via robots.txt.

The paper documents a sharp increase in such restrictions starting in the fall of 2023, which coincides with the time when Open AI, Google, and others began documenting the ability to use robots.txt to block their crawlers from ingesting publicly available content. The paper shows that the pushback against AI tools from content creators and website owners who object to their work being used for AI training purposes is not only real, but is becoming a major issue for AI companies that rely on publicly available online content as training data. The authors suggest that this will be particularly problematic for smaller companies and research projects that do not have the resources to license such content.

In Europe, where research uses are allowed under a mandatory exception to copyright that cannot be overridden by contract or technological measures, the negative impact on researchers is likely to be more limited than the authors fear. However, there are many other beneficial uses, such as search or web archiving, that will be affected by blanket restrictions via robots.txt and other means. In this context, the authors point to the need for better protocols, which is very much in line with our arguments for standardized rights holder opt-outs. From the concluding section of the paper:
The web needs better protocols to communicate intentions and consent. The [Robots Exclusions Protocol] places an immense burden on website owners to correctly anticipate all agents who may crawl their domain for undesired downstream use cases. We consistently find this leads to protocol implementations that don’t reflect intended consent. An alternative scheme might give website owners control over how their webpages are used rather than who can use them. This would involve standardizing a taxonomy that better represents downstream use cases, e.g. allowing domain owners to specify that web crawling only be used for search engines, or only for non-commercial AI, or only for AI that attributes outputs to their source data. New commands could also set extended restriction periods given dynamic sites may want to block crawlers for extended periods of time, e.g. for journalists to protect their data freshness. Ultimately, a new protocol should lead to website owners having greater capacity to self-sort consensual from non-consensual uses, implementing machine-readable instructions that approximate the natural language instructions in their Terms of Service.
Both the New York Times and 404 media have published articles that go into more detail on the paper.
On Wednesday, the French competition authority fined Google 250 million euros for failing to inform news publishers about the use of their content to train a generative AI system. Technically, the fine is a penalty for failing to comply with commitments Google had previously made to news publishers in a 2021 settlement that is being monitored by the competition authority. What makes the decision noteworthy is that it is the first documented case in which the use of copyrighted works to train AI systems has been challenged under the EU's copyright framework. While there are numerous pending cases pitting creators and other rights holders against generative AI companies, almost all of them have been filed outside the EU. As we have pointed out before, the main shortcoming of the EU approach to regulating the use of copyrighted works for AI training is the lack of standardized ways for creators and other rightholders to opt out. This aspect also played a key role in the Competition Authorities' decision, which states:
Furthermore, until at least September 28, 2023 and the launch of its "Google Extended" tool, Google did not offer a technical solution enabling publishers and press agencies to oppose the use of their content by Bard without affecting the display of this content on other Google services. Indeed, until now, publishers and press agencies wishing to oppose such use had to insert an instruction opposing any indexing of their content by Google, including on the Search, Discover and Google News services, which were precisely the subject of negotiation for the remuneration of neighboring rights. In the future, the Autorité will be particularly attentive to the effectiveness of the opt-out mechanisms put in place by Google.
This once more underlines the urgent need to implement standards for rightsholder opt-outs that are efficient, flexible and scalable.
Last week, the Polish government proposed its much-delayed implementation of the Copyright Directive. The implementation proposal contained a big surprise: the Polish government is proposing to add language to the Text and Data mining that asserts that “reproduction of works for text and data mining cannot be used to create generative AI models.” Paul has published an analysis of the proposal on the Kluwer Copyright Blog in which he argues that such a limitation is not only non-compliant with the provisions of the CDSM directive. It is also based on flawed assumptions and would result in a legal mess:
At this point, it seems useful to recall the key balances inherent in the EU’s regulatory framework for the use of copyrighted works in AI training. They form the basis of claims by the Commission and others that the EU has a uniquely balanced approach to this thorny issue. Taken together, the TDM provisions address 4 key concerns: (1) They limit permission to use copyrighted works for training data to those works that are lawfully accessible. They (2) privilege non-profit scientific research, (3) they ensure that creators and other rights holders can exclude their works from being used to train generative AI systems, and (4) they ensure that works that are not actively managed by their rights holders can be used to train AI models. Excluding the training of generative AI from this balanced arrangement may please some creators and rights holders, but it also pushes AI back into a legal gray area. It also seems incompatible with the provisions of the AI Act, which situates the training of generative AI models within the broader concept of TDM, and which will be directly applicable in Poland.
Expanding on this analysis, we have also submitted a contribution (PL|EN) to the public consultation launched by the Polish Ministry of Culture and National Heritage that argues for an implementation in line with the directive and suggests that Polish lawmakers should instead focus on enabling a fair remuneration for creators who opt out of TDM and ensuring the sustainability of public information resources.
In a recent article titled 'Generative AI Has a Visual Plagiarism Problem', Gary Marcus and Reid Southen provide further evidence of the ability of generative AI models to reproduce remarkably similar versions of works in their training data. They show that, in response to generic prompts, the latest versions of Midjourney and dall-e return images that closely resemble frames from popular movies and/or contain copyrighted characters. This discovery raises a number of interesting questions about the ability of these models to infringe copyright - seemingly on their own. The article is also notable for a quote from David Holz, founder and CEO of Midjourney in response to a question about whether Midjourney seeks permission from copyright holders. His answer:
No. There isn’t really a way to get a hundred million images and know where they’re coming from. It would be cool if images had metadata embedded in them about the copyright owner or something. But that’s not a thing; there’s not a registry. There’s no way to find a picture on the Internet, and then automatically trace it to an owner and then have any way of doing anything to authenticate it.
While this response sounds derisive in the context of the article (a similar statement made by Open AI to the House of Lords was also criticized as derisive), Holz does have a point. There is indeed an urgent need for better copyright information infrastructures that allow AI model developers and others to automatically assess the copyright status of works - and clear rights. Something we pointed out in our recent policy paper on best practices for opting out of ML training and an earlier white paper on a public repository of public domain and openly licensed works.
At this year's Blender Conference, Paul gave a talk on AI, the commons, and the limits of copyright. The talk rehashes some of the arguments made in an earlier blog post with the same title, and combines them with the seven recommendations for making AI work for creators and the commons that we developed with other participants of this year's Creative Commons Summit. A recording of Paul's talk is available on the Blender YouTube channel:

Ahead of this year's Creative Commons Summit in Mexico City, Open Future and Creative Commons hosted a one-day workshop to discuss the impact of generative AI on creators and the commons. The workshop explored how legal and regulatory contexts differ around the world and how this affects the development of shared strategies for dealing with the impact of generative AI on the commons and the position of creators.

Based on this discussion, and in subsequent conversations over the three days of the summit, the group identified a set of seven principles that could guide further work on creating an equitable framework for the regulation of generative AI around the world. These principles were published as part of a statement on "Making AI work for Creators and the Commons" which was published on the Creative Commons blog on the final day of the Summit.

Today, Open AI announced that GPTBot, the web crawler used to collect training data for its GPT series of large language models, can now be blocked via the robots.txt protocol. Site administrators can either disallow crawling of entire sites or create custom rules that allow `GPTBot` access to some parts of a site while blocking it from others. This functionality gives site owners a level of control over how their content is used by OpenAI's LLMs that they previously lacked.

At first glance, OpenAI's approach follows the opt-out mechanism established by the TDM exceptions in the EU copyright framework. But on closer inspection, the model/vendor-specific nature of this approach raises more questions than it answers, as it implies that it is the responsibility of website publishers to set rules for each individual ML training crawler operating on the web, rather than setting default permissions that apply to all ML training crawlers.

According to the announcement, 40,000+ individual artworks have been opted out from use for ML training via the haveibeentrained.com tool. The remaining 79 million+ opt-outs were registered through partnerships with platforms (such as ArtStation) and large rightholders (such as Shutterstock).

These opt-outs are for images included in the LAION 5B dataset used to train the Stable Diffusion text-to-image model. Stability AI has announced that the opt-outs collected by spawning.ai and made available via an API will be respected in the upcoming training of Stable Diffusion V3.

As we have previously argued, such opt-outs are supported by the EU's legal framework for machine learning, which allows rights holders to reserve the right to text and data mining carried out for all purposes except academic research undertaken by academic reserach institutions. Spawning.ai is the first large-scale initiative to leverage this framework to offer creators and other rights holders the ability to exclude their works from being used for machine learning training.

keep up to date
and subscribe
to our newsletter
Subscribe