AI and Creative Labor

Licensing, Levies, and the Limits of Copyright

July 14, 2025

MEP Axel Voss questions his own copyright framework while offering insights on remuneration and transparency for AI.

June 30, 2025

June 30, 2025

Beyond AI and copyright

This White Paper calls for a levy on commercial AI systems to fund public infrastructures and ensure a sustainable Digital Knowledge Commons in the generative AI era.

May 6, 2025

IETF working group will further develop our proposal for an opt-out vocabulary

May 6, 2025

The IETF AI Preferences Working Group has adopted Open Future's opt-out vocabulary proposal as their starting point for developing international standards for expressing AI preferences on the open internet.

March 24, 2025

Analysis

March 24, 2025

Is Web Scraping the Only Copyright Concern for AI?

The EU's AI Code of Practice has a blind spot: it only limits copyright compliance requirements to web crawling. This narrow focus ignores other data collection methods—such as torrenting—potentially creating loopholes in AI training data regulations.

March 7, 2025

A vocabulary for opting out of AI training and other forms of TDM

March 7, 2025

This proposal presents an opt-out vocabulary for AI training and text mining, based on stakeholder discussions to help creators better control how their works are used.

February 26, 2025

Open Future’s Response to the UK Consultation on Copyright and AI

We have submitted our response to the UK government's consultation on Copyright and AI.

Read our response

January 29, 2025

Our feedback on the first outline for the AI training data template

January 29, 2025

Our feedback on the the first outline of the template for a summary of training data that was presented by the AI office in January.

December 17, 2024

The UK government proposes to reinvent the wheel

Today, the UK government launched a consultation on copyright and artificial intelligence. In the consultation, the UK government essentially signals its intention to adopt the EU approach to the use of copyrighted works for the purpose of training AI models, as embodied in the commercial TDM exception in Article 4 of the Copyright in the Digital Single Market Directive (which was adopted with UK support before Brexit but never implemented after Brexit). In addition, the UK government also proposes to introduce a version of the training data transparency provisions that mirror the obligation in Article 53(1)(d) of the EU AI Act:

This consultation seeks views on how we can deliver a solution that achieves our key objectives for the AI sector and creative industries. These objectives are:

Supporting right holders’ control of their content and ability to be remunerated for its use.

Supporting the development of world-leading AI models in the UK by ensuring wide and lawful access to high-quality data.

Promoting greater trust and transparency between the sectors.

Our aim is to deliver these objectives through a package of interventions, to be considered together, that addresses the needs of both sectors, providing clarity and transparency. The proposals include a mechanism for right holders to reserve their rights, enabling them to license and be paid for the use of their work in AI training. Alongside this, we propose an exception to support the use at scale of a wide range of material by AI developers where rights have not been reserved. This approach would balance right holders’ ability to seek remuneration while providing a clear legal basis for AI training with copyright material, so that developers can train leading models in the UK while respecting the rights of right holders. For this approach to work, greater transparency from AI developers is a prerequisite—transparency about the material they use to train models, how they acquire it, and about the content generated by their models. This is vital to strengthening trust, and we are seeking views on how best to deliver it.

The deadline for responses to the consultation is the 25th of February 2025 and responses can be submitted here.

Submit response

November 20, 2024

event

November 20, 2024

From Code to Conduct: Insights from a Mozilla Morning

The Mozilla Foundation and Open Future co-hosted an event with policymakers, industry representatives, and civil society to explore how to make content used by AI more transparent.

November 4, 2024

PD12M: a fully open image training dataset with community governance

Spawning has released PD12M, a fully open dataset consisting of 12.4 million image-caption pairs. The dataset exclusively consists of public domain and CC0 licensed images that have been obtained from Wikimedia Commons, a large number of cultural heritage organizations, and the iNaturalist website. From the paper accompanying the release:

We present Public Domain 12M (PD12M), a dataset of 12.4 million high-quality public domain and CC0-licensed images with synthetic captions, designed for training text-to-image models. PD12M is the largest public domain image-text dataset to date, with sufficient size to train foundation models while minimizing copyright concerns. Through the Source.Plus platform, we also introduce novel, community-driven dataset governance mechanisms that reduce harm and support reproducibility over time.

The release of PD12M is remarkable not only given the size of the fully open dataset but also because of the holistic approach that Spawning has taken. Via the source.plus platform, Spawning provides community-based governance mechanisms. In addition, the platform also provides an exemplary level of transparency regarding the sources of the images included in the dataset. The release of PD12M is exciting not only because it builds on our ideas for a public data commons but also because Spawning sees the release of the dataset as a first step towards offering a foundational public domain image model with no IP concerns, that will help artists to fine-tune, and own, their own models on their own terms.

October 29, 2024

Museums and AI: Balancing Innovation and Integrity

October 29, 2024

This video provocation, presented at the European Heritage Hub Forum, focuses on AI systems' implications for the role of cultural heritage institutions.

October 10, 2024

Analysis

October 10, 2024

LAION vs Kneschke

The Landgericht Hamburg's decision to allow LAION to include a photographer's image in the LAION-5B training dataset empowers non-profit providers of public training datasets, which play a critical role in making AI training more transparent.

July 24, 2024

Analysis

July 24, 2024

Machine readable or not?

Observations on the hearing of the first court case dealing with the use of copyrighted works as AI training data in the context of the European Union's copyright framework.

July 22, 2024

Consent in Crisis: The Rapid Decline of the AI Data Commons

Last week, the Data Provenance Initiative at MIT released a new paper by Shayne Longpre et al. that shows a dramatic increase in restrictions on the use of publicly available content as AI training data. This first large-scale longitudinal study of the restrictions placed on online content via robots.txt and terms of service shows that over the past year, restrictions on online content included in a number of commonly used AI training datasets have increased dramatically. From virtually no restrictions on use as training data just a year ago, the researchers found that more than 28% of the most actively maintained, critical sources for the C4 training dataset are now completely restricted from use via robots.txt.

The paper documents a sharp increase in such restrictions starting in the fall of 2023, which coincides with the time when Open AI, Google, and others began documenting the ability to use robots.txt to block their crawlers from ingesting publicly available content. The paper shows that the pushback against AI tools from content creators and website owners who object to their work being used for AI training purposes is not only real, but is becoming a major issue for AI companies that rely on publicly available online content as training data. The authors suggest that this will be particularly problematic for smaller companies and research projects that do not have the resources to license such content.

In Europe, where research uses are allowed under a mandatory exception to copyright that cannot be overridden by contract or technological measures, the negative impact on researchers is likely to be more limited than the authors fear. However, there are many other beneficial uses, such as search or web archiving, that will be affected by blanket restrictions via robots.txt and other means. In this context, the authors point to the need for better protocols, which is very much in line with our arguments for standardized rights holder opt-outs. From the concluding section of the paper:

The web needs better protocols to communicate intentions and consent. The [Robots Exclusions Protocol] places an immense burden on website owners to correctly anticipate all agents who may crawl their domain for undesired downstream use cases. We consistently find this leads to protocol implementations that don’t reflect intended consent. An alternative scheme might give website owners control over how their webpages are used rather than who can use them. This would involve standardizing a taxonomy that better represents downstream use cases, e.g. allowing domain owners to specify that web crawling only be used for search engines, or only for non-commercial AI, or only for AI that attributes outputs to their source data. New commands could also set extended restriction periods given dynamic sites may want to block crawlers for extended periods of time, e.g. for journalists to protect their data freshness. Ultimately, a new protocol should lead to website owners having greater capacity to self-sort consensual from non-consensual uses, implementing machine-readable instructions that approximate the natural language instructions in their Terms of Service.

Both the New York Times and 404 media have published articles that go into more detail on the paper.

Read the paper

May 16, 2024

Considerations for implementing rightholder opt-outs by AI model developers

May 16, 2024

This policy brief explores what compliance policies for Article 53(1c) of the AI Act could look like in practice and what technical standards and services are available to implement the rightholder opt-outs.

March 22, 2024

France fines Google for unauthorized use of press publications to train AI

On Wednesday, the French competition authority fined Google 250 million euros for failing to inform news publishers about the use of their content to train a generative AI system. Technically, the fine is a penalty for failing to comply with commitments Google had previously made to news publishers in a 2021 settlement that is being monitored by the competition authority. What makes the decision noteworthy is that it is the first documented case in which the use of copyrighted works to train AI systems has been challenged under the EU's copyright framework. While there are numerous pending cases pitting creators and other rights holders against generative AI companies, almost all of them have been filed outside the EU. As we have pointed out before, the main shortcoming of the EU approach to regulating the use of copyrighted works for AI training is the lack of standardized ways for creators and other rightholders to opt out. This aspect also played a key role in the Competition Authorities' decision, which states:

Furthermore, until at least September 28, 2023 and the launch of its "Google Extended" tool, Google did not offer a technical solution enabling publishers and press agencies to oppose the use of their content by Bard without affecting the display of this content on other Google services. Indeed, until now, publishers and press agencies wishing to oppose such use had to insert an instruction opposing any indexing of their content by Google, including on the Search, Discover and Google News services, which were precisely the subject of negotiation for the remuneration of neighboring rights. In the future, the Autorité will be particularly attentive to the effectiveness of the opt-out mechanisms put in place by Google.

This once more underlines the urgent need to implement standards for rightsholder opt-outs that are efficient, flexible and scalable.

February 29, 2024

Poland gets creative with Text and Data Mining

Last week, the Polish government proposed its much-delayed implementation of the Copyright Directive. The implementation proposal contained a big surprise: the Polish government is proposing to add language to the Text and Data mining that asserts that “reproduction of works for text and data mining cannot be used to create generative AI models.” Paul has published an analysis of the proposal on the Kluwer Copyright Blog in which he argues that such a limitation is not only non-compliant with the provisions of the CDSM directive. It is also based on flawed assumptions and would result in a legal mess:

At this point, it seems useful to recall the key balances inherent in the EU’s regulatory framework for the use of copyrighted works in AI training. They form the basis of claims by the Commission and others that the EU has a uniquely balanced approach to this thorny issue. Taken together, the TDM provisions address 4 key concerns: (1) They limit permission to use copyrighted works for training data to those works that are lawfully accessible. They (2) privilege non-profit scientific research, (3) they ensure that creators and other rights holders can exclude their works from being used to train generative AI systems, and (4) they ensure that works that are not actively managed by their rights holders can be used to train AI models. Excluding the training of generative AI from this balanced arrangement may please some creators and rights holders, but it also pushes AI back into a legal gray area. It also seems incompatible with the provisions of the AI Act, which situates the training of generative AI models within the broader concept of TDM, and which will be directly applicable in Poland.

Expanding on this analysis, we have also submitted a contribution (PL|EN) to the public consultation launched by the Polish Ministry of Culture and National Heritage that argues for an implementation in line with the directive and suggests that Polish lawmakers should instead focus on enabling a fair remuneration for creators who opt out of TDM and ensuring the sustainability of public information resources.

February 1, 2024

Alignment Assembly on AI and the Commons

February 1, 2024

Open Future is hosting an asynchronous, virtual alignment assembly for the open movement to explore principles and considerations for regulating generative AI. We hope to reach 500 participants, spread across different fields of open and coming from different regions of the world.

January 10, 2024

In a recent article titled 'Generative AI Has a Visual Plagiarism Problem', Gary Marcus and Reid Southen provide further evidence of the ability of generative AI models to reproduce remarkably similar versions of works in their training data. They show that, in response to generic prompts, the latest versions of Midjourney and dall-e return images that closely resemble frames from popular movies and/or contain copyrighted characters. This discovery raises a number of interesting questions about the ability of these models to infringe copyright - seemingly on their own. The article is also notable for a quote from David Holz, founder and CEO of Midjourney in response to a question about whether Midjourney seeks permission from copyright holders. His answer:

No. There isn’t really a way to get a hundred million images and know where they’re coming from. It would be cool if images had metadata embedded in them about the copyright owner or something. But that’s not a thing; there’s not a registry. There’s no way to find a picture on the Internet, and then automatically trace it to an owner and then have any way of doing anything to authenticate it.

While this response sounds derisive in the context of the article (a similar statement made by Open AI to the House of Lords was also criticized as derisive), Holz does have a point. There is indeed an urgent need for better copyright information infrastructures that allow AI model developers and others to automatically assess the copyright status of works - and clear rights. Something we pointed out in our recent policy paper on best practices for opting out of ML training and an earlier white paper on a public repository of public domain and openly licensed works.

December 14, 2023

A first look at the copyright relevant parts in the final AI Act compromise

December 14, 2023

Representatives of the European Parliament, EU member states, and the European Commission reached a provisional agreement on the proposed AI Act. The copyright provisions in the AI Act are a step in the right direction. They further consolidate the existing balanced legislative approach adopted by the EU in the 2019 CDSM Directive.

December 7, 2023

event

December 7, 2023

Creativity, Ownership and Public Value in the Age of AI.

We participated in the Shaping Europe’s digital model conference organized by the Socialists and Democrats Group in the European Parliament. Paul led a panel on Creativity, Ownership, and Public Value in the Age of AI.

November 29, 2023

AI and copyright: Convergence of opt-outs?

November 29, 2023

The blog post argues that with increasing convergence on creator/rightholder opt-outs as an essential mechanism in the governance of generative AI models, there is an urgent need for standardization of machine readable opt outs.

November 15, 2023

Friction in AI Governance: there’s more to it than breaking servers

November 15, 2023

In this article, Nadia Nadesan examines collective bargaining as an essential element of AI governance.

October 27, 2023

Blender Talk: AI, the commons and the limits of copyright

At this year's Blender Conference, Paul gave a talk on AI, the commons, and the limits of copyright. The talk rehashes some of the arguments made in an earlier blog post with the same title, and combines them with the seven recommendations for making AI work for creators and the commons that we developed with other participants of this year's Creative Commons Summit. A recording of Paul's talk is available on the Blender YouTube channel:

October 9, 2023

CC community statement on “Making AI work for Creators and the Commons”

Ahead of this year's Creative Commons Summit in Mexico City, Open Future and Creative Commons hosted a one-day workshop to discuss the impact of generative AI on creators and the commons. The workshop explored how legal and regulatory contexts differ around the world and how this affects the development of shared strategies for dealing with the impact of generative AI on the commons and the position of creators.

Based on this discussion, and in subsequent conversations over the three days of the summit, the group identified a set of seven principles that could guide further work on creating an equitable framework for the regulation of generative AI around the world. These principles were published as part of a statement on "Making AI work for Creators and the Commons" which was published on the Creative Commons blog on the final day of the Summit.

Read the statement

September 28, 2023

Defining best practices for opting out of ML training

September 28, 2023

This Open Future policy brief examines the technical implementation of the EU law provision allowing authors and other rightholders to opt out of having their works used as training data for (generative) machine learning (ML) systems.

August 7, 2023

Opting-out of ML training – one model at at time?

Today, Open AI announced that GPTBot, the web crawler used to collect training data for its GPT series of large language models, can now be blocked via the robots.txt protocol. Site administrators can either disallow crawling of entire sites or create custom rules that allow `GPTBot` access to some parts of a site while blocking it from others. This functionality gives site owners a level of control over how their content is used by OpenAI's LLMs that they previously lacked.

At first glance, OpenAI's approach follows the opt-out mechanism established by the TDM exceptions in the EU copyright framework. But on closer inspection, the model/vendor-specific nature of this approach raises more questions than it answers, as it implies that it is the responsibility of website publishers to set rules for each individual ML training crawler operating on the web, rather than setting default permissions that apply to all ML training crawlers.

June 22, 2023

AI, the Commons, and the limits of copyright

June 22, 2023

There has been a lot of attention on copyright and generative AI/ML over the last few months. In this essay, I propose a two-fold strategy to tackle this situation. First, it is essential to guarantee that individual creators can opt out of having their works used in AI training. Second, we should implement a levy that redirects a portion of the surplus from training AI on humanity's collective creativity back to the commons.

March 8, 2023

Spawning.ai announces to have collected opt-out requests for 80 million artworks.

According to the announcement, 40,000+ individual artworks have been opted out from use for ML training via the haveibeentrained.com tool. The remaining 79 million+ opt-outs were registered through partnerships with platforms (such as ArtStation) and large rightholders (such as Shutterstock).

These opt-outs are for images included in the LAION 5B dataset used to train the Stable Diffusion text-to-image model. Stability AI has announced that the opt-outs collected by spawning.ai and made available via an API will be respected in the upcoming training of Stable Diffusion V3.

As we have previously argued, such opt-outs are supported by the EU's legal framework for machine learning, which allows rights holders to reserve the right to text and data mining carried out for all purposes except academic research undertaken by academic reserach institutions. Spawning.ai is the first large-scale initiative to leverage this framework to offer creators and other rights holders the ability to exclude their works from being used for machine learning training.

February 17, 2023