We are publishing today, as a request for comments, a white paper on understanding the implications of face recognition training with CC-licensed photographs. This is an outcome of research work that we initiated last year. We chose this case, as it offered an opportunity to address a more general argument about challenges to open sharing, which we formulated in our essay on the Paradox of Open.
The starting point for this case is a series of incidents concerning the use of CC-licensed photographs of people, published predominantly on Flickr, in datasets that were later used for training of face recognition systems. In 2019, Adam Harvey published research on MegaFace, a dataset that contained 3 million CC-licensed photographs, used most probably without necessary consent from portrayed individuals. The MegaFace case became an example of the tension between the open sharing of photographs of people – with tools like the Creative Commons licenses – and potential harms, mainly related to privacy violations and extractive use of personal data. (As part of our research we also commissioned a study on CC licensing of these datasets from Adam Harvey).
In 2022, the AI training datasets built with CC-licensed (Adam’s research provides details on almost a dozen of them) are still in use. Over the years, these datasets were used to train facial recognition models that were later used in hundreds of projects, including the development of military technologies or surveillance solutions.
The case of these datasets is important for the AI research community, as it aims to address potential harms resulting from AI technologies, and to ensure that not just their use, but also their development is ethical and responsible. There is an ongoing debate on governance of AI training datasets, to which we are contributing with the whitepaper that we are publishing today. .
But this case is just as important for advocates of open licensing and sharing. Our whitepaper identifies challenges that largely fall beyond the copyright system, but which nevertheless need to be addressed by those advocating for the open sharing of data and content. And this requires looking at open sharing from the perspective of the impact on digital rights on the one hand, and issues related to research and technology ethics on the other.
The AI_Commons case is also relevant, and fascinating, because it relates to technological challenges that are deeply felt in today’s zeitgeist. The risks related to AI technologies are an obvious context for this case. But it is also one about consequences of use of works on a massive scale which until recently has been hard to imagine by those who share them. Re-use on this scale tests the limits of practical applicability of some of the cornerstones of open licensing, like attribution. Finally, this is a case about our faces, as they increasingly become a resource that is being extracted, and has to be protected.
The challenges raised by these datasets can by now be seen as largely historic. YFCC100M, the oldest of the face recognition training datasets built with CC licensed photographs, is almost a decade old. We therefore see our work not so much in finding solutions to the specific issues that we have analysed but rather as input to help us understand from being better equipped to master challenges in the future. And indeed there are new developments related to openness and AI technologies that go beyond the matter of this case: new licensing approaches aim to reconcile openness and responsible use; and new datasets are built by simply scraping web content.
Nevertheless, we believe that there are useful lessons to be learned from the AI_Commons case. For the AI research community, the case gives clues to better governance of AI training datasets. For open advocates, it provides an opportunity to review open licensing frameworks and to make them future-proof.
We are currently soliciting feedback on our white paper and are in particular interested in:
We invite you to share feedback, comments and criticism directly in the PubPub publication or by writing to Alek Tarkowski (email@example.com). As the next step, we will be organizing a series of conversations – please fill out this form if you would like to contribute to the discussion.