Adam Harvey on CC licensing of AI training datasets

“The Exploitation of Photography: How Creative Commons Licenses Enable Surveillance” is a new essay from Adam Harvey that builds on research that we have commissioned from him, as part of our AI_Commons initiative. Adam Harvey has in the past conducted, together with Jules LaPlace, groundbreaking research into facial recognition datasets – which they then combined into Exposing.ai, an artistic intervention that lets users find themselves within these datasets. Adam’s work is one of several, unique efforts to shed light on the genealogy of these datasets – on issues that, as Adam notes, are not often disclosed either by researchers or companies working on AI training. With his work, Adam has successfully brought the issue to the mainstream debate on the impact of digital technologies and surveillance.

As part of our collaboration, Adam studied more closely what role Creative Commons (CC) licenses play in the creation, distribution and use of these datasets for the training of AI facial recognition systems. To quote Adam’s description of his aims,

This report unfolds how licenses once designed to facilitate “openness for the common good” have been misinterpreted to eventually become synonymous with a misguided “free and legal for all” logic that often ignores the legal requirements of Creative Commons.

Adam charts how CC-licensed photo databases, and in particular Flickr, became a go-to source for so-called “media in the wild,” which are praised by AI researchers. Afterward, he documents how CC-licensed content has been misrepresented as “free and legal to use” by creators of one of the largest and most significant training datasets, the Yahoo! Flickr Creative Commons 100 Million (YFCC100M) dataset from 2014. Adam notes that “the institutional sheen behind YFCC100M effectively provided a legal smokescreen that helped pave the way for a large-scale exploitation of Flickr photos”.

Adam’s research provides the necessary foundations and evidence for any conversation on this case – by investigating the history and characteristics of 13 major datasets, with a focus on licensing aspects. While it is common knowledge that the case involves CC-licensed content, through his work, we obtained more extensive empirical data, on the composition of these datasets and CC licensing distribution. His research also shows how, multiple times, creators of datasets declared commitment to using openly licensed content, but also seemingly failed to properly understand and adhere to licensing conditions. The lack of legal expertise at different stages of the creation and distribution of these datasets is one of the most significant, and also surprising, findings from Adam’s work.

Adam identifies several key issues related to the potential misuse of CC-licensed photographs for AI training: commercial use of non-commercial images, non-consensual use of biometric data and lack of attribution. These specific issues should at some point be considered by legal experts dealing with these cases – such conversations have already been initiated at the last Creative Commons Summit, during the session titled “How Can Open Sharing and Privacy Coexist?”.

At the same time, the detailed analyses of the datasets raise questions about their governance. The data analysis produced by Adam confirms our hypothesis that there is a significant policy vacuum around AI training with open datasets. We previously investigated this issue at MozFest 2022 by exploring how design-led solutions could better protect users’ biometric data in the context of open sharing. The issue concerns not just creators and owners of the datasets, but also creators and owners of CC-licensed content, and other entities that collectively shape the space of open licensing and open sharing.

The study is currently published as a draft, with a request for comments and feedback.

In June, we will be publishing a whitepaper that combines our analysis of the data produced by Adam and insights derived from a survey study of photographers publishing photos of faces under CC licenses, which we conducted together with Selkie Research. The whitepaper is intended as input for a series of stakeholder workshops where we intend to discuss the issues raised by using openly licensed photos as input for AI training with stakeholders from across the open licensing ecosystem.