Filling the policy vacuum on AI training with open datasets

The use of openly licensed photos of faces for the purpose of training AI facial recognition systems has been raised in recent years as one of controversial use cases for Creative Commons licensed content.

Since the case received media attention in 2019, it has been often raised as an example of inherent conflict between openness and privacy. And of the extraction of value from the commons by corporations.

In the background, there are growing concerns about the ethics of artificial intelligence and machine learning technologies, especially in relation to biometric data.

We launched the AI_Commons research activity to find a solution to this issue. By studying this case, we hope to define better how governance of shared resources can balance open sharing with protection of personal data and privacy. We also see this as a case that concerns irrevocability of CC licenses and their unintended uses, and thus the challenge of making the CC licensing stack future-proof.

Finally, this is a case that explores the limit of the Open Access Commons approach to sharing. We are exploring whether for some types of data we need a stronger, more managed commons and data governance.

This initiative is part of our work on Data Commons and is also a key case illustrating the Paradox of Open.

We are running this research in collaboration with Adam Harvey, an artist and technologist from the project. We commissioned Adam to conduct an independent study of CC licensing in the context of datasets for AI facial recognition training. You can read the report  “The Exploitation of Photography: How Creative Commons Licenses Enable Surveillance” on Adam Harvey’s webpage.



Use of openly licensed photographs and machine learning: summary of survey results
A research report presenting results of a survey conducted by Selkie Study as part of our AI_Commons initiative. The survey allowed us to gather insights from users of photo-sharing platforms, on the use of content that they shared openly for AI training. The main objective of this study was to identify possible points of controversy around the usage of open content for the development of AI technologies. It also enabled us to outline directions for further research and for supporting users in understanding and reacting to incidents related to the development of AI technologies.
Notes on BLOOM, RAIL, and openness of AI
The launch of BLOOM, an open language model capable of generating text, and the related RAIL open licenses by BigScience, together with the launch of Stable Diffusion, a text-to-image language, shows that a new approach to open licensing is emerging. In Notes on BLOOM, RAIL, and openness of AI, Alek outlines the challenges to established ways of understanding open faced by AI researchers, as they aim to enforce their vision of not just open, but also responsible AI.