Open Source, AI and the Paradox of Open

In the grip of the paradox of open

This summer has seen a renewed interest in Open Source Artificial Intelligence. This interest has been fuelled by developments like the release of Meta’s Llama 2.0 Large Language model under what Meta calls an open source license — a label rejected by the organization in charge of the open source definition, discussions about how the upcoming EU AI Act should deal with open source AI systems and, finally, a strategic bet by France on open source development.

These developments provide the backdrop for the “Open (For Business): Big Tech, Concentrated Power, and the Political Economy of Open AI” paper published last month by David Gray Widder, Sarah West & Meredith Whittaker. In their paper, the authors argue that “even the most open of ‘open’ AI systems do not, on their own, ensure democratic access to or meaningful competition in AI, nor does openness alone solve the problem of oversight and scrutiny.”

The paper is an important contribution to the discussion about the potential, limitations, and regulatory implications of applying the open source development method to developing so-called AI systems. It focuses on understanding the potential and limitations of open source to provide alternatives to the current approach for developing and deploying AI systems dominated by Big Tech. Here, the authors’ critical take is very similar to observations that we have made in our work on the “paradox of open” from 2020 onwards. Widder, West, and Whittaker are correct in pointing out that

‘openness’ often enables systemic exploitation of developers’ and creators’ labor while maintaining the infrastructural and ecosystem dominance of the largest firms. In the context of high levels of corporate concentration and gatekeeping over the ingredients necessary to build AI systems, ‘open’ AI as currently operationalized is primed for corporate capture.

The above mirrors the core observation from the paradox of open: “openness can both challenge and enable the concentration of power,” and the authors of the Open for Business paper do a very good job at detailing these dynamics at play in open source AI development. As Luis Villa phrases it in his review of the paper, “”open” is not a magic bullet to the heart of power.”

But that does not mean we should abandon openness as a value and an architectural principle when shaping the future of AI development and regulation. Regarding alternative ways of “doing tech,” openness still has much to offer. With regard to AI systems, we explore these advantages in our recent policy paper that formulates recommendations on how the upcoming EU AI Act should deal with open source AI development.

Creating conditions for Open Source AI to make an impact

However, if open alone cannot democratize AI, what else must happen? The paper’s authors identify several bottlenecks that prevent openness as an approach to developing AI from taking power away from big tech monopolies. In the remainder of this post, we will closely examine two bottlenecks: limited access to the computational power required to develop powerful models and challenges related to data required to build large-scale AI systems.

On the question of access to computational resources, the paper highlights that…

…the computational resources needed to build new AI models and use existing ones at scale, outside of privatized enterprise contexts and individual tinkering, are scarce, extremely expensive, and concentrated in the hands of a handful of corporations, who themselves benefit from economies of scale, the capacity to control the software that optimizes compute, and the ability sell costly access to computational resources.

Similarly, access to data constitutes another bottleneck. Here, Widder, West, and Whittaker observe that …

… The preparation and curation of data used to train and calibrate leading large-scale AI models involves resource-intensive processes much more complicated than downloading an openly available dataset. The transparency and reusability of datasets like the Pile and CommonCrawl allow for better evaluation of model training and limitations. But beyond the cost and time required to create them in the first place, significant labor is involved in curating these before they’re used in training in order to enable better model performance.

Let’s take a look at both of these challenges in more detail.

Access to compute: digital public infrastructure

As the authors point out, one of the primary obstacles to open source AI development is the price and the concentration of the required computational ressouces (large clusters of high-end GPUs) in the hands of tech giants. As a result, a significant portion of what appears to be independent progress in AI development that is widely labelled as open source is, in fact, intricately linked to or dependent on the generosity of the industry and, as such, further entrenches existing power structures.

In essence, the open-source machine learning researchers who work outside of or without the support of big tech are compute-poor and find that they cannot afford the “freedom to run the code.” This problem casts a shadow over the field, demonstrating that to truly democratize AI, the disparity in access to computational resources must be addressed.

The reliance on infrastructure and services offered by commercial entities poses a widespread challenge that extends beyond AI development. We’ve pointed out before that public institutions have been held back from updating their public interest missions in response to the challenges and opportunities resulting from digital transformation and that there is a need for more investment into public digital infrastructures.

This logic makes even more sense in the context of AI development. If we want to reduce our dependence on big tech, we — in the European context, that means the EU — must invest in public infrastructures. In the context of AI, this means establishing a robust and publicly accessible computational foundation for open source AI research.

This insight is already gaining traction: The German association Large-Scale Artificial Intelligence Network has launched a petition, calling on the European Union to establish a publicly funded and democratically governed research facility capable of building large-scale artificial intelligence models. Conversations on this topic have also flourished in France, where in June, President Macron announced new funding for an open “digital commons” for French-made generative AI projects. And more recently, Ursula von der Leyen expressed her support for digital public infrastructures that are trusted, interoperable, and open to all.

All of these efforts indicate that there is a clear path towards addressing the computational resource bottleneck. This recognition should translate into concrete investments that address the disparity in access to computational resources with the aim of fostering independent open source AI development.

Access to data: digital commons

Access to training data constitutes another important bottleneck for open source AI development. AI model developers rely on massive amounts of content created and often curated by others. While there is no commonly accepted definition of open source AI at the moment, many observers assume that for an AI system to be meaningfully open, the training data used in its creation must be publicly available in a form that allows re-use.

Transparency about data is a precondition to entertaining any further consideration of whether AI is “open” or not. But, just like openness, transparency alone will not ensure democratic access to or meaningful competition in AI, nor will it fix bias or prevent the dynamics of corporate capture or influence.

As Widder, West, and Whittaker pointed out, the public availability of training data is currently the exception rather than the rule, and commercial AI models of all sorts are increasingly opaque about the training data used. The relative scarcity of openly available training data sets makes it hard for open source AI developers to compete against commercial competitors who often have access to vast amounts of proprietary data in addition to data scraped from the public internet.

As we have argued before, such data scraping practices are problematic. To prevent exploitation and value extraction, transparency requirements must be part of more general and ambitious governance mechanisms for AI training data. These governance mechanisms should achieve at least two objectives:

First, they must ensure that people who share information and artistic expression online have tools for opting out that data from AI use.
Second, in case open or publicly available data is used by AI, this governance mechanism must guarantee a way for fair “giving back” to the creators and communities.

It is this second mechanism that has the potential to level the playing field in favor of independent open source AI development, as it would increase the amount of training data that is publicly available.

While the European Union’s legal frameworks provide some assurance that personal information and artistic creations are protected from use in AI training, they do not yet offer a mechanism to ensure that the use of open or publicly accessible data in AI applications is not extractive or exploitative and that it contributes back to the digital commons.

Our work has shown that the concept of commons has value in addressing this challenge. Compared to fully open data sources, data commons introduce more robust safeguards and control mechanisms, recognizing the effort and value associated with their creation. When using data commons for AI training, this should cover the significant labor involved in preprocessing the data to make it suitable for AI applications.

Treating data available for AI as commons offers a framework to balance openness and sharing with the need to ensure data’s sustainability and rules for protecting the interests of its creators. It also has the potential to address the data disadvantage faced by independent open-source AI developers, highlighted by Widder, West, and Whittaker. The bottom line is that technologies that build on the digital commons must be kept from depleting or jeopardizing them. Instead, these technologies need to be governed to nourish the commons.

Alternatives must still be open

We agree with Widder, West, and Whittaker that openness alone will not democratize AI. However, it is clear to us that any alternative to current Big Tech-driven AI must be, among other things, open. Widder, West, and Whittaker identify important bottlenecks that prevent open source AI efforts from reaching their potential. Their contribution to the discussion is significant and underscores the importance of investing in public digital infrastructure to implement regulations that can strengthen the digital commons.