There has been a lot of attention on copyright and generative AI/ML over the last few months. Outside of the circles of copyright lawyers, the main point of interest is how far AI systems capable of producing synthetic media should be allowed to use copyrighted works as part of their training data. Creators are alarmed about what many see as a regulatory failure in the fact of yet another assault by tech companies on the business models they have come to depend on. Or as an open letter published last month frames it:
AI art generators are trained on enormous datasets, containing millions upon millions of copyrighted images, harvested without their creator’s knowledge, let alone compensation or consent. This is effectively the greatest art heist in history. Perpetrated by respectable-seeming corporate entities backed by Silicon Valley venture capital. It’s daylight robbery.
The above quote deals specifically with AI art generators, such as Stable Diffusion or DALL-E. However, the same argument could apply to other generative machine learning models that create text, music, or other synthetic content forms. In this essay, I propose a two-fold strategy to tackle this situation. First, it is essential to guarantee that individual creators can opt out of having their works used in AI training. Second, we should implement a levy that redirects a portion of the surplus from training AI on humanity’s collective creativity back to the commons.
In essence, the question here is how we should deal with the fact that machines can now consume human creativity, reassemble it and spit out synthetic content that very much resembles the creative output previously produced by human creators. The fact that this is possible and that the technology underpinning these systems will continue to improve is a technological inevitability at this stage. At the same time — and this is at the heart of much of the discussion — many observers feel that this is a situation that copyright should be able to regulate.
But therein lies the problem: copyright, as we know it today, is not very well equipped for this task and is clearly reaching its conceptual limits regarding generative ML systems. It is mainly because, counter-intuitively, the output of generative ML systems does not consist of copies — or adaptations — of the works on which they have been trained. Copyright as we know it was conceived long before anyone could imagine machines that could consume (for lack of a better word) several billion copyrighted works and use the information gained as a basis for spitting out new things that contain tiny bits of information gleaned from the works they were trained on[1] .
The only part of this process where actual copies are made is during training: The works that make up the training data are temporarily copied onto the computer systems that run the training. The extent to which these copies infringe the rights of copyright owners varies significantly from jurisdiction to jurisdiction. In countries such as Japan, South Korea, and Singapore, these copies do not require permission from the rightsholders under the so-called text and data mining exceptions to copyright, which allow temporary copies to be made for the purpose of computational analysis.
In the US, it is still unclear whether such copying infringes copyright or should be considered ‘fair use’. The theory that ML training constitutes fair use and does not require permission from rights holders is currently being challenged in the courts.
Neither of these approaches gives creators any say regarding if and how their works can be used as training data for generative ML systems.
Meanwhile, the EU has put forward a somewhat more nuanced approach that provides creators and other rightsholders with some agency over how their works can be used.
Under the EU copyright rules, making copies to train ML models is allowed in the context of scientific research. Crucially it is also allowed for all other purposes unless the creators (or rightsholders) indicate that they don’t want their works to be used for ML training[2] .
This gives those creators who care about such things considerable control over how their works are used. They can decide that their works should not be used to train generative models (because they object to this kind of use or because they prefer to build their own personal ML models, which they can then license), or they can license the use of their works for ML training, either individually or collectively. In other words, as long as they reserve their rights (via machine-readable opt-outs), they can rely on all the traditional tools provided by copyright.
At the same time, it also means that the collective creative output of creators who do not care about such things remains available for training ML models. Unless opting out becomes the norm (which is very unlikely given that most of the content online is produced by creators who have little incentive to do so), this will mean that a sufficiently large part of the human creativity accumulated on the internet of the past three decades will remain available as training materials for AI developers. This has led to a situation where many fear (in the words of Naomi Klein ) that
… what we are witnessing is the wealthiest companies in history (Microsoft, Apple, Google, Meta, Amazon …) unilaterally seizing the sum total of human knowledge that exists in digital, scrapable form and walling it off inside proprietary products, many of which will take direct aim at the humans whose lifetime of labor trained the machines without giving permission or consent.
It is this appropriation of the digital commons, collectively built over the last three decades, that copyright is ill-equipped to deal with. Artist, activist, and AI entrepreneur Mat Dryhurst describes it as “a question of appropriation by means not accounted for in existing law.[3] ” So how do we account for this appropriation of the “sum total of human knowledge”?
In the logic of copyright, the output of ML models does not qualify as a derivative of all the works contained in the data on which a model has been trained. But on another level, it is, of course, true that the output of a given generative ML model is very much a derivative of all the copyrighted works contained in its training data: just as the works produced by human creators embody the lived experience of their creators, the output of generative ML models potentially draws on all the works on which it has been trained. The “works” coming out of generative ML models come out partly because someone (or something) fed the model a particular set of training data. Without access to the works used to train the model, the model would not be able to produce its output. We currently have no legal framework for dealing with a situation where a synthetic work is “derived” from billions of existing works (be they copyrighted or not).
If we want to find a solution, we need to move away from the analytical framework provided by copyright, which is based on the ownership of individual works, and recognize that what generative ML models use are not individual works for their individual properties but rather collections of works of unimaginable size: state-of-the-art image generators are trained on billions of individual works, and text generators are regularly trained on more than 100 billion tokens. At this scale, removing individual works from the training data has no discernible effect on the models, contributing to the structural weakness of individual creators relative to the entities that train ML models.
In other words, the object of this appropriation are not copyrighted works but rather the “sum total of human knowledge that exists in digital, scrapable form”. This is a case of the paradox of open: it is open access to this digital commons that has enabled the creation of the current crop of generative ML models, and it is at this level that we will need to address ongoing appropriation and develop means of transferring some of the value created back to society.
Much of this digital commons consists of works that are free of copyright, openly licensed, or the product of online communities where copyright plays at best a marginal role in incentivising the creation of these works. This is another reason why the response to the appropriation of these digital commons cannot be based on copyright licensing, as this would unfairly redistribute the surplus to professional creators who are part of collective management entities and a small subset of those who created the digital commons.
So if this appropriation is “not accounted for by existing laws”, how shall it be dealt with then? At this stage (and given the global scope of the question) it seems dubious if (national) laws can deal with it.
Instead, we should look for a new social contract, such as the United Nations Global Digital Compact[4] , to determine how to spend the surplus generated from the digital commons. A social contract would require any commercial deployment of generative AI systems trained on large amounts of publicly available content to pay a levy. The proceeds of such a levy system should then support the digital commons or contribute to other efforts that benefit humanity, for example, by paying into a global climate adaptation fund. Such a system would ensure that commercial actors who benefit disproportionately from access to the “sum of human knowledge in digital, scrapable form” can only do so under the condition that they also contribute back to the commons.