This line of our work explores the consequences of the fact that machines can now consume human creativity, reassemble it, and spit out synthetic content that closely resembles the creative output previously produced by human creators.
The arrival of powerful generative machine learning models in 2022 raised important questions about their impact on creators and other rightholders. Will generative AI systems replace human creators? How will they affect the income of creators and other cultural producers? Do AI companies have the right to use copyright-protected works as training data for their models, and if so, under what conditions? And what does the emergence of generative AI tell us about the limits of copyright?
Our work in this area is guided by the objective of making AI work for both creators and the Digital Commons.
No. There isn’t really a way to get a hundred million images and know where they’re coming from. It would be cool if images had metadata embedded in them about the copyright owner or something. But that’s not a thing; there’s not a registry. There’s no way to find a picture on the Internet, and then automatically trace it to an owner and then have any way of doing anything to authenticate it.While this response sounds derisive in the context of the article (a similar statement made by Open AI to the House of Lords was also criticized as derisive), Holz does have a point. There is indeed an urgent need for better copyright information infrastructures that allow AI model developers and others to automatically assess the copyright status of works - and clear rights. Something we pointed out in our recent policy paper on best practices for opting out of ML training and an earlier white paper on a public repository of public domain and openly licensed works.
GPTBot, the web crawler used to collect training data for its GPT series of large language models, can now be blocked via the
robots.txt protocol. Site administrators can either disallow crawling of entire sites or create custom rules that allow `GPTBot` access to some parts of a site while blocking it from others. This functionality gives site owners a level of control over how their content is used by OpenAI's LLMs that they previously lacked.
At first glance, OpenAI's approach follows the opt-out mechanism established by the TDM exceptions in the EU copyright framework. But on closer inspection, the model/vendor-specific nature of this approach raises more questions than it answers, as it implies that it is the responsibility of website publishers to set rules for each individual ML training crawler operating on the web, rather than setting default permissions that apply to all ML training crawlers.