In mid-July, Meta released Llama 2, the next generation of its large language models (LLMs). The company described the model release as “open source, available for free and for commercial use.”
Machine learning models and their components have been open-sourced for years by organizations like Eleuther.ai. Last year, with the launch of models like Stable Diffusion or Bloom, open-source AI gained greater visibility. The field of AI development, previously dominated by a few companies that develop closed technologies, is now portrayed as one in which closed and open solutions compete. In May, an internal Google memo claimed that there is “no moat around closed-source AI development” and that open-source solutions will out-compete companies like Google or OpenAI.
In Europe, in the final stages of legislative work on the AI Act, specific rules are being defined for open-source foundation AI models. These rules validate the open-source approach to AI development and recognize that open-source models can serve market competition, research and innovation, and the public interest.
As the rules get written, one challenge is building sufficient guardrails against corporations’ attempts at “open washing:” releasing resources that appear open-source without being true to open-source development.
In this analysis, I review the Llama 2 release strategy and show its non-compliance with the open-source standard. Furthermore, this case demonstrates the need for more robust governance that mandates training data transparency.
In this context, the release of Llama 2 was a critical moment: a major, commercial LLM was openly released but proved not compliant with the existing open-source standard. This analysis aims to look at the Llama 2 release in more detail to determine how it differs from standard open-source approaches and whether it is a case of open washing. It is relevant for the regulatory work on the AI Act and its exception for open-source AI development. The case should also inform grassroots efforts at defining governance standards, such as the OSI’s AI Deep Dive process – which aim to review the open source definition – and the Digital Public Goods Alliance’s work on defining AI systems as digital public goods.
This analysis is based on the assumption that the open-source definition is not set in stone. So while most analyses of the Llama 2 release (such as this piece from Charlie Hull at Open Source Connections) have focused on the fact that it is not compliant with the open source standard, this analysis also considers whether any elements of this release are relevant for the exploration of new “open(ish)” approaches to sharing LLMs.
The large corporations developing the leading generative AI models have been doing so with a closed-source approach. Historically, these companies have contributed to the open-source development of key technologies for machine learning (such as the TensorFlow library) and even shared them as open-source early models (such as GPT-2). Yet they have refrained from open-sourcing their newest models. Emblematically, OpenAI has shifted from open to closed-source, in a move pitched by the company execs as one done in the name of responsible AI development. On the other hand, open-source models are being created by new market entrants, non-profits (such as Eleuther.ai), or state actors (such as UAE’s Falcon).
Meta’s approach to open-sourcing Llama 2 complicates this picture and signals that the company has adopted a different approach than its competitors. The fact that the release was accompanied by a “Statement of Support for Meta’s Open Approach to Today’s AI” shows that Meta might feel pressure to explain its approach in a field where closed-source development is the norm, justified by principles of responsible development.
And as European legislators consider rules that make open source AI systems exempt from obligations introduced by the AI Act, there is a need for “regulation that treats open source to foster innovation without providing a pink slip to necessary regulatory compliance” – as the issue is framed in a recent Mozilla position.
The case of Llama 2, a corporate model dubbed as “open source” while not adhering to the open source standard, offers an excellent opportunity to “red team” the emerging governance model for open source AI.
The release notes (including the Llama 2 webpage and the white paper detailing the model) describe Llama 2 using various terms: open, open source, “available for free for research and commercial use.” The white paper uses the term “Responsible Release Strategy” to highlight that the open release of Llama 2 is meant to encourage responsible AI innovation. Looking in more detail at the licensing agreement and other elements of Meta’s release approach will help us understand whether this is a case of “open washing.”
Critics have portrayed Llama 2’s release as a case of open washing. Indeed, using the term “open source” in the release documents is misleading since the license is not compliant with the open source definition. At the same time, the additional conditions introduced in the Llama license are worth reviewing in the context of an ongoing debate about open-source AI development.
Firstly, introducing use restrictions aligns with a “responsible AI development” approach that aims to balance open research with AI ethics. This approach, championed by the RAIL initiative and offering solutions to some of the perceived problems of open-source development, has not gained much popularity. The use of a RAIL-like license by Meta to release Llama 2 might be a significant boost for this approach. And while use restrictions are not compliant with the current open-source standard, they should be seen as a potential addition to any updated standard.
Secondly, the fact that the license does not apply to Meta’s key commercial competitors is clearly an anti-competitive measure – the limitation targets a narrow group of Big Tech companies. This kind of provision is a measure that deals with the Paradox of Open: the fact that digital public goods have in the past been exploited (mainly by the Big Tech companies) without giving back to the commons. The type of cap introduced in the Llama 2 license is a way to limit such exploitation – but it does not serve this purpose when it is used to protect the assets of a Big Tech company like Meta. Such a cap should instead be adopted by smaller entities, aiming at protecting the commons from exploitation by the largest actors. And this approach will be rejected by many open source advocates, who believe that the licenses should be agnostic with regard to the type of users benefitting from them.
Third issue is the limitation on the use of Llama 2 and its components in developing other models. In recent months, many projects have taken an open-source model and fine-tuned it with data generated from a commercial, more capable model (primarily GPT-3/GPT-4/ChatGPT). Stanford’s Alpaca project initiated this trend. More than a dozen projects have combined data from OpenAI models with the first-generation Llama model, training a new model for a specific domain or chat service. While the research community is of mixed opinions about the outcomes of such fine-tuning projects, the method could be used to bootstrap a new, better-performing model off one of the Llama 2 models. There are also other uses of data from high-performing models, for example, for reinforcement learning to further refine other AI models. This licensing condition is therefore not just non-compliant with the open source standard, but – even more importantly – will harm further AI research into open-source models.
The rules specified in the Llama 2 license support collaborative development only within the bounds of the Llama 2 ecosystem. In this way, Meta is looking for a way to bet on open source – and its advantages concerning research and innovation – while building its own “moat” against its competitors.
The Llama 2 white paper offers a detailed view of how the model was created, focusing on the training and fine-tuning processes and the work on model safety. Against this background, the lack of information about the training data is telling. There is just a vague statement that the model was “trained on a new mix of publicly available data.”
This is in stark contrast to the original Llama model, whose white paper lists the sources of training data. It is also clearly a sign of the times, as in just a few months, the issue of training data sourcing has become highly controversial — with OpenAI infamously releasing GPT-4 without disclosing any information about the training data. As our research on a decade of face recognition training has shown, AI researchers have, for a long time, adopted an overly lax attitude to the sourcing and use of training data.
The lack of information on Llama 2 training suggests that Meta currently considers its training data to be a potential liability. It is also a significant limitation of the Llama 2 model, affecting the possibility of its further development. It is also telling that an approach described as focused on “responsible release” — one that thoroughly outlines red teaming work or the environmental impact of the model’s training — treats issues related to data governance as ones that a short, oblique statement can summarise.
This case shows that any new standard for the openness of AI development should include rules on training data transparency. Traditionally, open-source development has been agnostic about the data used in open-source systems. The importance of training data for model development requires reconsidering this approach.
Today, the fact that Llama 2’s creators hide information about training data is not a reason for not defining the model as “open source” (as I argued above, it is the anti-competitive clauses that make the model non-compliant). At the same time, documentation of training data is often raised as a voluntary practice that distinguishes open source model development. It is time for the open source community to declare a stronger position in this regard.
A recent policy position from Mozilla proposes that a standard for open-source AI development could include the requirement of openly releasing training data. While this might be an overly strong standard (as training of models can also rely on exceptions to use data that is not openly licensed), it is the right step in acknowledging the importance of data governance.
As the recent position paper on “Supporting Open Source and Open Science in the EU AI Act” (to which Open Future contributed) states, regulatory support for open-source AI development can foster open science, market competition, and innovation. But to do so, a standard for open-source AI needs to ensure that these goals, and public interest, will be secured. The Llama 2 model release analysis confirms this need – especially since the open-source status of a given AI system will determine how it will be regulated in Europe under the AI Act.
The term “open source” should not be expanded to include releases that do not comply with such a standard (or the existing, general standard embodied in the Open Source Definition). Meta should, therefore, not be using the term “open source” for a model whose license fundamentally breaks open-source development – as is the case with Llama 2’s anti-competitive clauses.
Yet the case also confirms the need to revisit the open source standard, either by modifying the definition itself, or by establishing additional protocols, specific to AI development. Crucially, these should establish norms for training data transparency. Regulators might introduce these – although the recent agreement between the White House and American AI companies indicates that some are reluctant. For this reason, the open-source community should also define a norm – even if questions about data provenance and transparency fall beyond the scope of the traditional open-source approach. The latest discussion paper from the Digital Public Goods Alliance offers guidance. It proposes that the digital public goods standards, when applied to AI systems, require open licensing of training data. The paper also argues for further work on defining a data governance standard for AI systems.
While the Llama 2 release has significant limitations and faults – when reviewed against the emerging principles for open-source AI development – it also includes novel mechanisms, like the attempts to build community-based governance. Hopefully, Meta will commit – in the spirit of such governance – to supporting a collaborative process aimed at defining the standard for open-source AI releases; and to making future release strategies compliant with this standard.
I would like to thank Hailey Schoelkopf from Eleuther.ai and Paul Keller for valuable feedback.