On Monday, the Open Source Initiative released version 1.0 of the Open Source AI Definition (OSAID). The definition is an important step in setting a standard for openness in AI development. Its main value is a clear stance on what does not constitute an open source AI system. At the same time, some questions, especially on data sharing requirements, remain contentious. The definition should be the starting point of a broader effort to define this standard that goes beyond the perspective of open source AI developers.
The term “open source AI” appears often in policy debate, but its meaning has until now been a matter of opinion. In this context the OSAID, which carries a lot of weight thanks to the status that the OSI enjoys in the open source ecosystem, marks an essential step towards a shared understanding of what openness means in the context of developing AI models and systems. In this analysis, we describe the importance of defining open source AI, look at limitations of the definition – in particular controversies related to its data-related provisions, and situate it in the broader context of efforts to establish a community norm for openness in AI.
A standard for openly developed and shared AI systems is needed today because these are increasingly seen as viable alternatives to the dominant, closed AI systems. So far, much of the policy debates around so-called “open source AI” have focused on the potential risks of the uncontrolled spread of AI technologies released under permissive licenses. Over the last year, the tone of the debate has shifted, with analyses demonstrating the value proposition of openly shared AI systems and their role in combating concentrations of power in AI.
Some commercial AI developers are also developing “open” solutions, although many of these efforts should be considered openwashing. Some companies – OpenAI and, more recently, Mistral – have initially committed to openness, only to enter the path of increasing closure and obfuscation. Emblematically, Meta pursues an “open source” strategy that indeed makes its models openly available, but under conditions that significantly curb reuse and without any transparency about the training data. Because of these developments, there is a need for a meaningful standard of openness in the context of AI development.
Over the last year, multiple initiatives have contributed to defining such a standard. The Linux Foundation published its Model Openness Framework, a classification system that defines three tiers for ranking machine learning models based on their completeness and openness. Mozilla proposed a Framework for Openness in Foundation Models, intending to provide a nuanced understanding of openness that will support further work around definitions of openness in AI. The Digital Public Goods Alliance has convened a Community of Practice on AI as a Digital Public Good, which recently published its recommendations. There is ongoing work by multiple organizations, including Open Knowledge Foundation and CNRS, Open Data Charter, and GovLab’s Open Data Policy Lab, to name just some.
The efforts of the Open Source Initiative have a special status among these, as the OSI is the steward of the Open Source Definition – and is therefore positioned to propose how open source development should be understood in the context of AI. Especially that many of the creators of open AI systems consider themselves open source developers and have adopted this term.
As various initiatives started exploring openness in AI, it became clear that this is not just a matter of applying a framework developed for software code to AI systems. The new standard needed to encompass the complex mix of training data, various pieces of software code, model weights, and documentation. For some of these components, well-developed frameworks exist for open sharing. For others – model weights being the prime example – there is still insufficient understanding of their copyright status and, thus, ways in which they can be openly shared.
The complexity of AI systems makes the challenge of defining a standard of openness exciting, as it can build on the experiences from various fields of open: Open Source, Open Science or Open Data. As almost any type of resource is today seen by AI developers as a potential training dataset, the issue is relevant for actors in all these fields, including Open Access, Open Education or Open Culture. This means that the Open Source perspective is a good starting point, but it is insufficient to address the issue entirely.
The starting point for the OSAID was the one underlying the original open source definition: the need to give developers certain possibilities to access and reuse code (the Free Software Definition has framed these in terms of four user freedoms). In the original definition, the OSI used the term “preferred form in which a programmer would modify the program” to describe what needs to be shared to secure these rights. The definition for software is built on a years-long practice of providing the source code under an open license..
Therefore, the Open Source Definition describes the conditions that a software license needs to secure to enable the sharing of software in this preferred form. Building on this definition, the OSI has focused for more than two decades on determining license compliance. The goal for AI systems is much more complex: the standard needs to define an AI system, its components, and its preferred forms for making modifications. Ultimately, this requires a compliance process that will concern not only the licenses used to share AI systems and their components but also the individual systems themselves.
Therefore, a standard for openness in AI needs first to define what an AI system is and what its components are. The OSI has used the definition of an AI system and machine learning established by the OECD. Regarding the AI system’s components, the definition builds on the work of the Linux Foundation and Mozilla, which both released detailed analyses of the AI stack. Generally speaking, there is now consensus that the key elements of AI systems are training data, code for training and running the system, and model parameters.
In broad terms, the definition states that an AI system is open source – that is, it is made available in the form that is most preferred for making those modifications that fulfill the four user freedoms if:
Several elements and aspects of the OSAID are worth noting regarding this definition, which we will address in turn.
For starters, the OSAID definition leaves many issues related to open licensing in the AI space unanswered. It takes a conservative approach that does not venture beyond well-established rules for sharing code and does not clarify the more complex issues. Most importantly, it avoids the issue of providing a standard for sharing model weights, stating that there is a lack of legal clarity on this matter. While this is not stated explicitly, the OSI has taken a position on new licenses developed in response to the emergence of the current generation of AI models, most prominently the so-called responsible AI licenses. These new licenses are not OSI-compliant, as they discriminate against certain forms of use, which means that systems released under such licenses do not comply with the new definition. Again, this conservative position ignores the reality that a substantial proportion of permissively shared AI models and other components are released under these licenses. While they probably do not meet the original definition of open source, an ongoing debate suggests that even with additional restrictions, this can be a meaningful form of openness in AI.
More importantly, the OSI decided that the data used to train an AI system does not need to be made available under an open license for an AI system to be considered open source. The decision was made based on input from a working group of volunteer model developers, who declared that data itself is not the preferred form for making modifications to the system. The OSI is also arguing that an open data sharing requirement would unnecessarily limit the applicability of the OSAID – since many datasets cannot be shared (for example, due to privacy concerns) but can still form the basis for systems that are shareable as a whole. This choice has proved to be both pragmatic and contentious.
With regards to training data, the OSAID instead requires detailed “data information” about the training datasets to be provided:
“(1) the complete description of all data used for training, including (if used) of unshareable data, disclosing the provenance of the data, its scope and characteristics, how the data was obtained and selected, the labeling procedures, and data processing and filtering methodologies; (2) a listing of all publicly available training data and where to obtain it; and (3) a listing of all training data obtainable from third parties and where to obtain it, including for fee.”
This is a much weaker stance than a requirement for all training data to be released under an open license would have been. This is still a progressive position in broader policy debates and, as a result, an important attempt at norm-setting. The OSAID not only sets a strong standard for developers who want to comply but also makes it easy to identify cases of corporate openwashing. Today, there is a visible trend towards less transparency of training data, driven by competition concerns (concerning access to proprietary data sources) and concerns about exposure to copyright liability (related mainly to web scraped data).
Working against this trend are efforts to establish transparency requirements for all AI systems, such as the training data transparency requirement in the European Union’s AI Act. Here, the OSAID requirements can serve as an important reference point that lends credence to efforts to ensure meaningful transparency, including the blueprint for an AI Act transparency template that we have developed together with Mozilla and several open source AI developers.
There are limits to training AI with open resources alone. Their volume and diversity are insufficient to successfully train large foundation models, and there are few avenues to make more data and content available under open licenses soon.For this reason, almost all open model development efforts depend heavily on publicly available data that has been web-scraped.
Looking from the perspective of AI developers, this makes the lack of an open data sharing requirement in OSAID the correct choice. First, having such a requirement would constrain developers, tying them to a resource that is not sufficient for their needs. Second, as they have argued in the process of developing the definition, their preferred modes of modifying AI systems do not depend on the openness of data.
The issue becomes contentious once a different perspective is introduced: that of Open Data advocates and, more broadly, stakeholders whose work focuses on various types of free and open content rather than software development (although OSI’s decision has also been contested by some open source developers). These stakeholders are focused less on the specific modalities of using data for AI training. Securing openness of various types of data and content is, for them, a goal in itself – and from this perspective, the lack of an Open Data requirement in the OSAID is a form of open washing: a signal that openness of AI is possible without the openness of the data used to train these systems.
Those who support an open data requirement also believe that, with this requirement, the OSAID would encourage more developers to share data. While there is a consensus that more availability of data is beneficial, for some the purpose is solely that of training AI systems; for others, there are broader societal goals to be achieved with freely shared knowledge. The question is whether the OSAID – designed as a standard that secures developer freedoms – is the right tool for securing more openness. This largely depends on whether there is an “openness gap” in data: are there ample resources that could be shared openly, but today, they are not?
Ultimately, the choice regarding a data sharing requirement is between a stronger but narrower standard that secures various forms of openness of resources related to AI development and a weaker but broader standard that acknowledges that AI training is not just done with open data. The OSI has made the latter—probably more pragmatic—choice.
The fact that such a choice exists means that the issue would be easiest to solve with a tiered standard. The Linux Foundation’s Model Openness Framework does just that by distinguishing three tiers of open AI systems (with a data sharing requirement as a key differentiating factor). The OSI has opted for a “binary”, non-tiered standard, arguing that it is necessary to clarify which system is compliant, and which is not. In doing so, the definition missed an opportunity to at least acknowledge the value of Open Data sharing – this could have been done by mentioning open sharing as an optional way of meeting the data transparency requirement.
The issue could also be resolved by adopting a different term than “open source” to describe the systems that comply with the current definition and reserving the term for fully open AI systems whose training data is also available under open licenses. This goes back to the – partially philosophical debate – on what constitutes the source of AI systems. The OSI has decided that training data should not be considered the source because it does not equate one-to-one with software source code. While this observation is correct, it should also be clear that the term “open source,” applied to AI, needs to be based on analogy rather than direct equivalence. Looking at the list of AI components and thinking about how models are trained on underlying data, training datasets are the primary component that can be understood as the source of AI systems.
Unfortunately, the OSI, at the outset of its process, decided on the name for the definition: Open Source AI definition. Notably, this brings the OSAID closely in line with the definition of open source AI systems used in the AI Act, which also does not require training data to be shared under an open license or made publicly available. Still, it would have been preferable for the OSI to choose a name for the definition that makes it clear that the level of openness required to meet the definition corresponds to what many observers call open weights models – and reserve the term “open source” for AI systems that include open data.
Unquestionably, the OSAID is an important step in defining the standard of openness in AI development. Its most important contribution is that it draws—in the form of the data information provisions—a clear line regarding minimum training data transparency standards for any AI system that can be considered to be open in a meaningful way.
This has already been validated by the fact that Meta – which claims that its Llama family of models is “open source” even though they are released under non-OSI compliant licenses and do not contain any information on the training data – publicly declaring that it does not agree with the definition. This illustrates that the OSAID already puts pressure on the companies not to use the term “open source” to describe models that are partially open at best.
Having achieved this, the OSI and its allies will now face the challenge of defending this standard and turning it from a proposal made by a key standard-setting organization into a commonly accepted one.
At the same time, version 1.0 of the OSAID should not be the last word on how data sharing is understood in the context of the development of AI systems. Hopefully, the definition itself will be iterated as OSI explores how the standard works. The definition and the associated FAQ signal other issues that need to be resolved, for example, related to legal means for sharing weights or the applicability of copyleft mechanisms.
More importantly, the OSAID should be seen as just one position in a broader, ongoing debate on data sharing. Also this week, the DPGA announced an upcoming consultation on its standard for AI as a digital public good, creating an opportunity to iterate on a broader, shared position. This debate needs to bridge the positions of AI developers with other stakeholders – most notably, stewards of various collections that are the basis of AI training datasets.
The debate around OSAID has clarified the contentious issues. As a next step, a shared community norm that recognizes the value of fully open AI systems that include open datasets and supports a gradient of other data sharing approaches that benefit open AI development must be defined. Hopefully, various organizations, as they define their own positions, will also engage in building this norm.
Open Future, together with the OSI, organised a workshop on data sharing and open source AI in mid-September. A report from the workshop is forthcoming.