Exploring commons-based approaches to machine learning
The release of powerful machine learning models under open licenses was a major event in the AI/ML development space in 2022. Until then, large generative models such as GPT-3 and Dall-E were seen as a force that would concentrate digital power in the hands of a few corporations. The release of the Stable Diffusion image generation model (and other models like BLOOM and Whisper) marked a significant change.
This was a breakthrough moment for the world of open, indicating the emergence of a new field in which the principles of open are applied. This is a nascent field in which there are still no established norms for openly sharing different elements of the machine learning stack: data, model, and code. Moreover, a new norm for sharing has emerged, expressed in a new suite of RAIL licenses that aim to combine an open licensing model with rules for responsible use.
By early 2023, it became clear that the emergence of generative AI would re-ignite copyright debates, which free culture and access to knowledge advocates had been involved in for the past two decades. Until then, public discussion about the potential harms of AI systems had focused on issues such as bias, disinformation and threats to privacy. Now, the list must include the issue of creators’ rights and rules for the reuse of creative works. This is a conversation that is familiar to open movement activists, but one that needs to move beyond its traditional framing. It is essential to understand how to balance creators’ and users’ rights in a context where creation is automated and reuse occurs in new ways.
Our research seeks to contribute to this public debate and to the emerging field of open and commons-based approaches to machine learning. We are particularly interested in the commons-based governance of datasets and models, the impact of generative AI on creativity, and the emergence of new licensing models that balance openness and responsible use.
We agree with Widder, West, and Whittaker that openness alone will not democratize AI. However, it is clear to us that any alternative to current Big Tech-driven AI must be, among other things, open.
In this article, Open Future fellow Nadia Nadesan shares learnings from facilitating a citizen assembly with Algorights to investigate local participation concerning the AI Act.
In this analysis, I review the Llama 2 release strategy and show its non-compliance with the open-source standard. Furthermore, I explain how this case demonstrates the need for more robust governance that mandates training data transparency.
Today, Open AI announced that GPTBot, the web crawler used to collect training data for its GPT series of large language models, can now be blocked via the robots.txt protocol. Site administrators can either disallow crawling of entire sites or create custom rules that allow `GPTBot` access to some parts of a site while blocking it from others. This functionality gives site owners a level of control over how their content is used by OpenAI's LLMs that they previously lacked.
At first glance, OpenAI's approach follows the opt-out mechanism established by the TDM exceptions in the EU copyright framework. But on closer inspection, the model/vendor-specific nature of this approach raises more questions than it answers, as it implies that it is the responsibility of website publishers to set rules for each individual ML training crawler operating on the web, rather than setting default permissions that apply to all ML training crawlers.
The AI Act should allow for proportional obligations in the case of open source projects while creating strong guardrails to ensure they are not exploited to hide from legitimate regulatory scrutiny.
The AI Act should provide clarity on the criteria by which a project will be judged to determine whether it has crossed the “commercialization” threshold, including revenue.
In a third recommendation, Mozilla highlights the importance of definitional clarity when it comes to regulating open source AI systems. Here Mozilla suggests maintaining a strict definition (that would exclude newer licenses like the RAIL family of licenses) and clarifying which components would need to be licensed under an open license for a system to be considered to be an open source AI system. According to Mozilla this should indicatively apply to models, weights and training data.
Today — together with Hugging Face, Eleuther.ai, LAION, GitHub, and Creative Commons, we publish a statement on Supporting Open Source and Open Science in the EU AI Act. We strongly believe that open source and open science are the building blocks of trustworthy AI and should be promoted in the EU.
This article examines an example from the global women's rights movement of how organizations and institutions support local actors to participate in transnational AI governance and challenge top-down structures and mechanisms.
There has been a lot of attention on copyright and generative AI/ML over the last few months. In this essay, I propose a two-fold strategy to tackle this situation. First, it is essential to guarantee that individual creators can opt out of having their works used in AI training. Second, we should implement a levy that redirects a portion of the surplus from training AI on humanity's collective creativity back to the commons.
Today, the European Parliament's IMCO and LIBE committees adopted their joint report on the proposed AI Act. The text includes additional safeguards for fundamental rights and an overall more cautious approach to AI. In this post, we provide an in-depth analysis of the implications of the text for open source AI development.
AI governance captures a wide range of processes and conversations, from internal governance policies for companies to public, national, and transnational regulatory bodies. For this blog series, I intend to map places where friction in this seemingly effortless and inevitable flow of technology occurs.
The following piece is the first part of a case study on how Wikipedia is positioned to address the challenges of open AI development. It spells out the general argument, which will be followed by more specific suggestions on how a wikiAI mission could look like.
Establishing a regulatory framework that achieves the dual objectives of protecting open-source AI systems and mitigating risks of potential harm is a critical imperative for the European Union. Especially since open-source, publicly supported AI systems are crucial digital public infrastructures that would ensure Europe’s sovereignty.
Clément Perarnaud on the role of standards in making the AI Act operational
The European Union's upcoming AI Act will require adequate standards to become fully operational, and much work is required to ensure that the standardization process does not conflict with the Act's inclusion and transparency objectives.
The process will be led by the European Committee for Standardization (CEN) and the European Committee for Electrotechnical Standardization (CENELEC). In the past, they have been criticized for their secrecy and lack of transparency. The standards must be made public, but some fear that the private sector will have too much control over the process, which could have an impact on human rights. The standards' nature and scope will also have geopolitical implications, with some calling for greater international cooperation.
Standards will be essential in enforcing the EU's AI legislation, and CEN-CENELEC will have just two years to formulate and agree on a series of AI standards.
The LAION proposal calls for a public research facility capable of building large-scale artificial intelligence models. It offers an alternative to corporate development of AI, in which responsible use is ensured in open source environments through the involvement of democratically elected institutions.
The rapid advancements in AI challenge the concept of openness on the internet, as companies use publicly available data to their advantage, frequently disregarding the concerns and welfare of other parties, such as artists and content creators, and the impacts of the tools they make available for use. There is a growing realization that the […]
The Future of Life Institute published an open letter asking for a moratorium on generative AI development. Yet social harms caused by AI will not be addressed in this way. Instead, commons-based governance of existing AI systems is needed.
Helberger and Diakopoulos on the AI Act and ChatGPT
Natali Helberger and Nicholas Diakopoulos have published an article titled "ChatGPT and the AI Act" in the Internet Policy Review. The article argues that the AI Act’s risk-based approach is not suitable for regulating generative AI due to two characteristics of such systems: their scale and broad context of use. These characteristics make it challenging to regulate them based on clear distinctions of risk and no-risk categories.
The article is relevant to us in the context of open source, general-purpose AI systems, and their potential regulation.
Helberger and Diakopoulos propose looking for inspiration in the Digital Services Act (DSA), which lays down obligations on mitigating systemic risks. A similar argument was made by Philipp Hacker, Andreas Engel, and Theresa List in their analysis.
Interestingly, the authors also point out that providers of generative AI models are currently making efforts to define risky or prohibited uses through contractual clauses. While they argue that “a complex system of private ordering could defy the broader purpose of the AI Act to promote legal certainty, foreseeability, and standardisation,” it is worth considering how regulation and private ordering (through RAIL licenses, which we previously analyzed) can contribute to the overall governance of these models.
Spawning.ai announces to have collected opt-out requests for 80 million artworks.
According to the announcement, 40,000+ individual artworks have been opted out from use for ML training via the haveibeentrained.com tool. The remaining 79 million+ opt-outs were registered through partnerships with platforms (such as ArtStation) and large rightholders (such as Shutterstock).
These opt-outs are for images included in the LAION 5B dataset used to train the Stable Diffusion text-to-image model. Stability AI has announced that the opt-outs collected by spawning.ai and made available via an API will be respected in the upcoming training of Stable Diffusion V3.
As we have previously argued, such opt-outs are supported by the EU's legal framework for machine learning, which allows rights holders to reserve the right to text and data mining carried out for all purposes except academic research undertaken by academic reserach institutions. Spawning.ai is the first large-scale initiative to leverage this framework to offer creators and other rights holders the ability to exclude their works from being used for machine learning training.
As generative machine learning (ML) becomes more widespread, the issue of copyright and ML input is back in focus. This post explores the Eu legal framework governing the use of copyrighted works for training ML systems and the potential for collective action by artists and creators.
The Collective Intelligence Project has published a new working paper by Saffron Huang and Divya Siddarth that discusses the impact of Generative Foundation Models (GFMs) on the digital commons. One of the key concerns raised by the authors is that GFMs are largely extractive in their relationship to the Digital Commons:
The dependence of GFMs on digital commons has economic implications: much of the value comes from the commons, but the profits of the models and their applications may be disproportionately captured by those creating GFMs and associated products, rather than going back into enriching the commons. Some of the trained models have been open-sourced, some are available through paid APIs (such as OpenAI’s GPT-3 and other models), but many are proprietary and commercialized. It is likely that users will capture economic surplus from using GFM products, and some of them will have contributed to the commons, but there is still a question of whether there are obligations to directly compensate either the commons or those who contributed to it.In response, the paper identifies three proposals for dealing with the risks that GFMs pose to the commons.
In response, the paper identifies three proposals for dealing with the risks that GFMs pose to the commons. Read the full paper here:
The RAIL licenses are gaining ground, but permissive sharing is still the prominent norm governing the sharing of ML models on huggingface.co. This analysis aims at understanding how licenses are used by developers making ML model-related code and or data publicly available.
None of the approaches dealing with open source AI systems in the AI Act address the concerns related to chilling effects on open source AI development so far. The Parliament still has the opportunity to address these concerns without jeopardizing the AI Act’s overall regulatory objective by leveraging on the inherent transparency of open source, writes Paul Keller.
The launch of BLOOM, an open language model capable of generating text, and the related RAIL open licenses by BigScience, together with the launch of Stable Diffusion, a text-to-image language, shows that a new approach to open licensing is emerging. In Notes on BLOOM, RAIL, and openness of AI, Alek outlines the challenges to established ways of understanding open faced by AI researchers, as they aim to enforce their vision of not just open, but also responsible AI.
Instead of analyzing the functioning of image generators through the lens of copyright, we should ask ourselves a normative question: Why should we want that copyright applies to the visual output of these generators?