AI and Creative Labor

October 10, 2024

The Landgericht Hamburg's decision to allow LAION to include a photographer's image in the LAION-5B training dataset empowers non-profit providers of public training datasets, which play a critical role in making AI training more transparent.

July 24, 2024

Analysis

July 24, 2024

Machine readable or not?

Observations on the hearing of the first court case dealing with the use of copyrighted works as AI training data in the context of the European Union's copyright framework.

July 22, 2024

Consent in Crisis: The Rapid Decline of the AI Data Commons

Last week, the Data Provenance Initiative at MIT released a new paper by Shayne Longpre et al. that shows a dramatic increase in restrictions on the use of publicly available content as AI training data. This first large-scale longitudinal study of the restrictions placed on online content via robots.txt and terms of service shows that over the past year, restrictions on online content included in a number of commonly used AI training datasets have increased dramatically. From virtually no restrictions on use as training data just a year ago, the researchers found that more than 28% of the most actively maintained, critical sources for the C4 training dataset are now completely restricted from use via robots.txt.

The paper documents a sharp increase in such restrictions starting in the fall of 2023, which coincides with the time when Open AI, Google, and others began documenting the ability to use robots.txt to block their crawlers from ingesting publicly available content. The paper shows that the pushback against AI tools from content creators and website owners who object to their work being used for AI training purposes is not only real, but is becoming a major issue for AI companies that rely on publicly available online content as training data. The authors suggest that this will be particularly problematic for smaller companies and research projects that do not have the resources to license such content.

In Europe, where research uses are allowed under a mandatory exception to copyright that cannot be overridden by contract or technological measures, the negative impact on researchers is likely to be more limited than the authors fear. However, there are many other beneficial uses, such as search or web archiving, that will be affected by blanket restrictions via robots.txt and other means. In this context, the authors point to the need for better protocols, which is very much in line with our arguments for standardized rights holder opt-outs. From the concluding section of the paper:

The web needs better protocols to communicate intentions and consent. The [Robots Exclusions Protocol] places an immense burden on website owners to correctly anticipate all agents who may crawl their domain for undesired downstream use cases. We consistently find this leads to protocol implementations that don’t reflect intended consent. An alternative scheme might give website owners control over how their webpages are used rather than who can use them. This would involve standardizing a taxonomy that better represents downstream use cases, e.g. allowing domain owners to specify that web crawling only be used for search engines, or only for non-commercial AI, or only for AI that attributes outputs to their source data. New commands could also set extended restriction periods given dynamic sites may want to block crawlers for extended periods of time, e.g. for journalists to protect their data freshness. Ultimately, a new protocol should lead to website owners having greater capacity to self-sort consensual from non-consensual uses, implementing machine-readable instructions that approximate the natural language instructions in their Terms of Service.

Both the New York Times and 404 media have published articles that go into more detail on the paper.

Read the paper

May 16, 2024

publication

May 16, 2024

Considerations for implementing rightholder opt-outs by AI model developers

This policy brief explores what compliance policies for Article 53(1c) of the AI Act could look like in practice and what technical standards and services are available to implement the rightholder opt-outs.

March 22, 2024

France fines Google for unauthorized use of press publications to train AI

On Wednesday, the French competition authority fined Google 250 million euros for failing to inform news publishers about the use of their content to train a generative AI system. Technically, the fine is a penalty for failing to comply with commitments Google had previously made to news publishers in a 2021 settlement that is being monitored by the competition authority. What makes the decision noteworthy is that it is the first documented case in which the use of copyrighted works to train AI systems has been challenged under the EU's copyright framework. While there are numerous pending cases pitting creators and other rights holders against generative AI companies, almost all of them have been filed outside the EU. As we have pointed out before, the main shortcoming of the EU approach to regulating the use of copyrighted works for AI training is the lack of standardized ways for creators and other rightholders to opt out. This aspect also played a key role in the Competition Authorities' decision, which states:

Furthermore, until at least September 28, 2023 and the launch of its "Google Extended" tool, Google did not offer a technical solution enabling publishers and press agencies to oppose the use of their content by Bard without affecting the display of this content on other Google services. Indeed, until now, publishers and press agencies wishing to oppose such use had to insert an instruction opposing any indexing of their content by Google, including on the Search, Discover and Google News services, which were precisely the subject of negotiation for the remuneration of neighboring rights. In the future, the Autorité will be particularly attentive to the effectiveness of the opt-out mechanisms put in place by Google.

This once more underlines the urgent need to implement standards for rightsholder opt-outs that are efficient, flexible and scalable.

February 29, 2024

Poland gets creative with Text and Data Mining

Last week, the Polish government proposed its much-delayed implementation of the Copyright Directive. The implementation proposal contained a big surprise: the Polish government is proposing to add language to the Text and Data mining that asserts that “reproduction of works for text and data mining cannot be used to create generative AI models.” Paul has published an analysis of the proposal on the Kluwer Copyright Blog in which he argues that such a limitation is not only non-compliant with the provisions of the CDSM directive. It is also based on flawed assumptions and would result in a legal mess:

At this point, it seems useful to recall the key balances inherent in the EU’s regulatory framework for the use of copyrighted works in AI training. They form the basis of claims by the Commission and others that the EU has a uniquely balanced approach to this thorny issue. Taken together, the TDM provisions address 4 key concerns: (1) They limit permission to use copyrighted works for training data to those works that are lawfully accessible. They (2) privilege non-profit scientific research, (3) they ensure that creators and other rights holders can exclude their works from being used to train generative AI systems, and (4) they ensure that works that are not actively managed by their rights holders can be used to train AI models. Excluding the training of generative AI from this balanced arrangement may please some creators and rights holders, but it also pushes AI back into a legal gray area. It also seems incompatible with the provisions of the AI Act, which situates the training of generative AI models within the broader concept of TDM, and which will be directly applicable in Poland.

Expanding on this analysis, we have also submitted a contribution (PL|EN) to the public consultation launched by the Polish Ministry of Culture and National Heritage that argues for an implementation in line with the directive and suggests that Polish lawmakers should instead focus on enabling a fair remuneration for creators who opt out of TDM and ensuring the sustainability of public information resources.

February 1, 2024

blog

February 1, 2024

Alignment Assembly on AI and the Commons

Open Future is hosting an asynchronous, virtual alignment assembly for the open movement to explore principles and considerations for regulating generative AI. We hope to reach 500 participants, spread across different fields of open and coming from different regions of the world.

January 10, 2024

In a recent article titled 'Generative AI Has a Visual Plagiarism Problem', Gary Marcus and Reid Southen provide further evidence of the ability of generative AI models to reproduce remarkably similar versions of works in their training data. They show that, in response to generic prompts, the latest versions of Midjourney and dall-e return images that closely resemble frames from popular movies and/or contain copyrighted characters. This discovery raises a number of interesting questions about the ability of these models to infringe copyright - seemingly on their own. The article is also notable for a quote from David Holz, founder and CEO of Midjourney in response to a question about whether Midjourney seeks permission from copyright holders. His answer:

No. There isn’t really a way to get a hundred million images and know where they’re coming from. It would be cool if images had metadata embedded in them about the copyright owner or something. But that’s not a thing; there’s not a registry. There’s no way to find a picture on the Internet, and then automatically trace it to an owner and then have any way of doing anything to authenticate it.

While this response sounds derisive in the context of the article (a similar statement made by Open AI to the House of Lords was also criticized as derisive), Holz does have a point. There is indeed an urgent need for better copyright information infrastructures that allow AI model developers and others to automatically assess the copyright status of works - and clear rights. Something we pointed out in our recent policy paper on best practices for opting out of ML training and an earlier white paper on a public repository of public domain and openly licensed works.

December 14, 2023

Opinion

December 14, 2023

A first look at the copyright relevant parts in the final AI Act compromise

Representatives of the European Parliament, EU member states, and the European Commission reached a provisional agreement on the proposed AI Act. The copyright provisions in the AI Act are a step in the right direction. They further consolidate the existing balanced legislative approach adopted by the EU in the 2019 CDSM Directive.

December 7, 2023

event

December 7, 2023

Creativity, Ownership and Public Value in the Age of AI.

We participated in the Shaping Europe’s digital model conference organized by the Socialists and Democrats Group in the European Parliament. Paul led a panel on Creativity, Ownership, and Public Value in the Age of AI.

November 29, 2023

blog

November 29, 2023

AI and copyright: Convergence of opt-outs?

The blog post argues that with increasing convergence on creator/rightholder opt-outs as an essential mechanism in the governance of generative AI models, there is an urgent need for standardization of machine readable opt outs.

November 15, 2023

Opinion

November 15, 2023

Friction in AI Governance: there’s more to it than breaking servers

In this article, Nadia Nadesan examines collective bargaining as an essential element of AI governance.

October 27, 2023

Blender Talk: AI, the commons and the limits of copyright

At this year's Blender Conference, Paul gave a talk on AI, the commons, and the limits of copyright. The talk rehashes some of the arguments made in an earlier blog post with the same title, and combines them with the seven recommendations for making AI work for creators and the commons that we developed with other participants of this year's Creative Commons Summit. A recording of Paul's talk is available on the Blender YouTube channel:

October 9, 2023

CC community statement on “Making AI work for Creators and the Commons”

Ahead of this year's Creative Commons Summit in Mexico City, Open Future and Creative Commons hosted a one-day workshop to discuss the impact of generative AI on creators and the commons. The workshop explored how legal and regulatory contexts differ around the world and how this affects the development of shared strategies for dealing with the impact of generative AI on the commons and the position of creators.

Based on this discussion, and in subsequent conversations over the three days of the summit, the group identified a set of seven principles that could guide further work on creating an equitable framework for the regulation of generative AI around the world. These principles were published as part of a statement on "Making AI work for Creators and the Commons" which was published on the Creative Commons blog on the final day of the Summit.

Read the statement

September 28, 2023

publication

September 28, 2023

Defining best practices for opting out of ML training

This Open Future policy brief examines the technical implementation of the EU law provision allowing authors and other rightholders to opt out of having their works used as training data for (generative) machine learning (ML) systems.

August 7, 2023

Opting-out of ML training – one model at at time?

Today, Open AI announced that GPTBot, the web crawler used to collect training data for its GPT series of large language models, can now be blocked via the robots.txt protocol. Site administrators can either disallow crawling of entire sites or create custom rules that allow `GPTBot` access to some parts of a site while blocking it from others. This functionality gives site owners a level of control over how their content is used by OpenAI's LLMs that they previously lacked.

At first glance, OpenAI's approach follows the opt-out mechanism established by the TDM exceptions in the EU copyright framework. But on closer inspection, the model/vendor-specific nature of this approach raises more questions than it answers, as it implies that it is the responsibility of website publishers to set rules for each individual ML training crawler operating on the web, rather than setting default permissions that apply to all ML training crawlers.

June 22, 2023

Opinion

June 22, 2023

AI, the Commons, and the limits of copyright

There has been a lot of attention on copyright and generative AI/ML over the last few months. In this essay, I propose a two-fold strategy to tackle this situation. First, it is essential to guarantee that individual creators can opt out of having their works used in AI training. Second, we should implement a levy that redirects a portion of the surplus from training AI on humanity's collective creativity back to the commons.

March 8, 2023

Spawning.ai announces to have collected opt-out requests for 80 million artworks.

According to the announcement, 40,000+ individual artworks have been opted out from use for ML training via the haveibeentrained.com tool. The remaining 79 million+ opt-outs were registered through partnerships with platforms (such as ArtStation) and large rightholders (such as Shutterstock).

These opt-outs are for images included in the LAION 5B dataset used to train the Stable Diffusion text-to-image model. Stability AI has announced that the opt-outs collected by spawning.ai and made available via an API will be respected in the upcoming training of Stable Diffusion V3.

As we have previously argued, such opt-outs are supported by the EU's legal framework for machine learning, which allows rights holders to reserve the right to text and data mining carried out for all purposes except academic research undertaken by academic reserach institutions. Spawning.ai is the first large-scale initiative to leverage this framework to offer creators and other rights holders the ability to exclude their works from being used for machine learning training.

February 17, 2023

Opinion

February 17, 2023

Protecting Creatives or Impeding Progress?

As generative machine learning (ML) becomes more widespread, the issue of copyright and ML input is back in focus. This post explores the EU legal framework governing the use of copyrighted works for training ML systems and the potential for collective action by artists and creators.

Making generative AI work for creators and the commons

Timeline