Copyright has a discovery problem

In a recent article titled ‘Generative AI Has a Visual Plagiarism Problem‘, Gary Marcus and Reid Southen provide further evidence of the ability of generative AI models to reproduce remarkably similar versions of works in their training data. They show that, in response to generic prompts, the latest versions of Midjourney and dall-e return images that closely resemble frames from popular movies and/or contain copyrighted characters. This discovery raises a number of interesting questions about the ability of these models to infringe copyright – seemingly on their own.

The article is also notable for a quote from David Holz, founder and CEO of Midjourney in response to a question about whether Midjourney seeks permission from copyright holders. His answer:

No. There isn’t really a way to get a hundred million images and know where they’re coming from. It would be cool if images had metadata embedded in them about the copyright owner or something. But that’s not a thing; there’s not a registry. There’s no way to find a picture on the Internet, and then automatically trace it to an owner and then have any way of doing anything to authenticate it.

While this response sounds derisive in the context of the article (a similar statement made by Open AI to the House of Lords was also criticized as derisive), Holz does have a point. There is indeed an urgent need for better copyright information infrastructures that allow AI model developers and others to automatically assess the copyright status of works – and clear rights. Something we pointed out in our recent policy paper on best practices for opting out of ML training and an earlier white paper on a public repository of public domain and openly licensed works.