How Wikipedia can shape the future of AI

Part one of two
Opinion
May 4, 2023

This year, Open Future is exploring the intersection of openness and AI development. We are looking into establishing transparency as a principle that increases the social value of these systems and democratizes these technologies. The following piece is the first part of a case study on how Wikipedia is positioned to address the challenges of open AI development. It spells out the general argument, followed by more specific suggestions on what a WikiAI mission could look like.


As we have explored issues around AI and openness, the need for a commons-based approach to AI development has become increasingly apparent. Today, this work is being driven by AI research companies (such as Hugging Face, EleutherAI, or Stability.AI that identify with openness as a value and have/propose a broader vision for democratizing machine learning technologies.

However, they are not the established stewards and creators of free knowledge or digital commons ecosystems. At the same time, the deployment of machine learning systems raises crucial questions for openness: open content is being used to train AI models, and questions arise about sharing AI technologies and their outputs.

So what should be the proper reaction of open activists and organizations? What would a more proactive, robust agenda for openness and AI look like?

Some work on it is already being done. For example, Creative Commons has been exploring how copyright law and tools apply to the generative AI space for many years. Mozilla has recently announced the launch of Mozilla.ai, a new startup tasked with “building a trusted, independent, and open-source AI ecosystem.”

Then, there is, of course, Wikipedia and the Wikimedia movement. As stewards of one of the largest repositories of free knowledge and the ecosystem (both social and technological) that supports it, it is uniquely positioned to address some of these issues. And Wikipedia is already deeply embedded in the emergent AI systems as a critical component of many AI training datasets.

The Case for WikiAI

In 2001, a group of activists and knowledge workers decided to take on a three-hundred-year-old knowledge industry and create a new encyclopedia that was not only better but also freely shared. Today, Wikimedians — and open advocates and producers in general — face a similar challenge. In this case, the incumbents are only half as old as Wikipedia.

Machine learning systems like the GPT models are being adopted this year at an unprecedented rate. They are poised to become the next general-purpose technology, like the Internet and the Web. The challenge for the open movement is already clear: build an alternative to corporate, closed machine learning systems. And to protect the commons from exploitation by these systems.

What is at stake is the design of a new content production and distribution ecosystem that will shape the entire digital environment and our societies. As has been the case in the past, a strong actor committed to the values of openness and free knowledge has a chance to tip the balance away from closed ecosystems controlled by commercial monopolists.

This is not just about saving the soul of a new technology. Machine-generated content production will significantly impact the free knowledge ecosystem and the commons-based peer production model. For this reason, the Wikimedia organizations and their partners should launch a WikiAI mission.

An opportunity to build a new approach to open

Is the goal to build and deploy “open AI”? Not necessarily, and not just because the name “OpenAI” is a trademarked company brand that wants to dominate this ecosystem.

The current moment is an opportunity to ask ourselves: what does openness, or a commitment to free knowledge, mean today? Should we build an ecosystem as open as the Wikimedians created twenty years ago? Or are there fundamental differences?

We need a new openness. When thinking about “open,” we need to pay more attention to the issue of power and its imbalances. The traditional activist view of openness is that it challenges concentrations of power — but we know it can also serve them. So open advocates should consider the issue of democratization (or social justice). Traditionally, greater freedom or equality has been seen as a natural outcome of open ecosystems. But that turns out not to be that simple – they need to be seen as something to be introduced by design and managed.

The goal is to address what Anna Mazgal calls the unintended consequences of open. A trend where public, user-generated value is locked into proprietary products, and the subsequent profits are privatized. As a remedy, Mazgal proposes a “permaculture of open,” where sharing is as important as contributions to free knowledge. In other words, Wikimedia’s contribution to a free and healthy Internet is not limited to its knowledge output — it also counts as a civic, democratic space.

Here, Wikimedians can benefit from the new Wikimedia Movement Strategy, developed and adopted for 2017-2020. The strategy lays out an approach that gives as much attention to equity and democracy as it does to freeing knowledge. It is a prime example of an effort by open activists to address the paradox of open.

Why should open activists care about AI?

Open-up AI systems should not be the goal of free knowledge advocates. The argument can be made by pointing out the hype that fuels AI development today. We may be witnessing another chapter in a multi-episode saga of the reckless deployment of technologies on a planetary scale.

The launch of the chatbot-powered Bing search engine was a fiasco (from a responsible development perspective), with an unhinged chatbot that was an example of why more significant moderation of AI is needed (as if Microsoft didn’t learn the lesson in 2016 when it deployed its first chatbot, Tay). And Cory Doctorow recently described Google’s AI position as driven by deep fear – which is not a good position to lead from.

The downstream deployment of these centralized technologies, as they appear in an ever-widening range of online services, also shows signs of AI being peddled as snake oil. There is an apparent rush to “put AI inside,” one digital platform at a time.

For these reasons, there is a strong position in the open movement that is averse to such hype. Activists are wary of technology development that is framed as helping humanity when, in fact, it is often driven by commercial interests. This was evident in the case of Web3, with open advocates weary even of the progressive conversations taking place in the space. Similar criticism of generative AI development can be expected.

There is also a sense that open ecosystems can evolve at their own slower pace and in ways that are more thoughtful and less hyped — even at the cost of significant tradeoffs. The Mastodon network, with a strong culture that makes it very different from its commercial counterparts, is currently the best example.

The open movement should be critical of the AI hype. But it would be a mistake not to move into this space now. The reason is that the world needs access to open and democratic AI systems that function as digital public goods. One reason for this, which I have signaled before, is that we need alternatives to closed corporate systems like ChatGPT. Secondly, the social value generated by machine learning systems will affect many areas of life — something that tends to be left out of many of the debates that focus on current challenges. Adequately governed, AI systems will support peer production and knowledge sharing.

A challenge…

The Wikimedia movement, in particular, should address the issue of AI in free knowledge ecosystems, not just because it is the most essential and prominent actor managing these ecosystems. As a large body of freely available content, Wikipedia is already a part of these systems — as a core component of the datasets on which many of the significant language models are built. For example, recent research by the Allen Institute and the Washington Post shows that Wikipedia is the second largest content source used in C4, a training dataset built by Google by scraping 15 million web pages. It is also one of the primary data sources for the open-source language modeling dataset, the Pile. And according to researchers, it is particularly relevant for improving these models.

This situation can be viewed in two ways. One way is to see it as an existential risk to Wikipedia and the broader project of freeing knowledge that has been taking place over the last two decades. Search engines have already disintermediated Wikipedia: Google uses Wikipedia content in the information boxes it displays, and as a result, many people need help to click through to Wikipedia itself. This is fine from a traditional free knowledge perspective — content should travel freely. But it undermines the sustainability of Wikipedia, which relies on people visiting the site for financial support and engagement with the encyclopedia.

Developments in generative AI may exacerbate this risk. We could be facing a world in which AI-model interfaces are the new gatekeepers of knowledge, and people are prompting chatbots instead of reading encyclopedias. And the Share Alike clause, explicitly designed to address this threat of enclosure (or exploitation), no longer seems to be a viable defense. Due to the specificity and complexity of AI development lifecycles, content reuse often does not fit the concepts on which this clause depends, such as copying content and creating derivatives.

By fully opening up its resources and paying attention to the way they are structured and documented, Wikipedia is radically setting itself up for its replacement by generative AI. Especially if the very models trained on Wikipedia begin to create content for the encyclopedia — quickly pushing human editors out of the loop, the knowledge could still be accessible (there are good reasons and strong advocates for keeping AI results in the public domain), but the permaculture model Mazgal proposes would no longer exist.

and an opening

However, this situation can also be seen as an opportunity. My source of optimism comes from observing the Wikimedia Enterprise initiative, which launched in 2021. It offers paid, enterprise-grade APIs suitable for building commercial solutions on top of Wikipedia content. And Google is one of its first customers (the second is the Internet Archive). The service is voluntary, but it provides a model that addresses the challenge of disintermediation and offers a solution that increases Wikimedia’s sustainability. As such, it should be seen as a significant innovation regarding the stewardship of free knowledge.

From a 2023 perspective, this initiative should be seen as an essential development in ensuring fair value chains in the field of AI development. It addresses one of the main concerns: when data is used to train AI systems, those who create the data are not compensated for their work.

So, I would suggest treating the Wikimedia Enterprise program as a founding stone for WikiAI. It is an initiative that shows what a proactive, optimistic approach to AI systems and free knowledge can look like.

Consequently, the Wikimedia Enterprise program should be regarded as the cornerstone of WikiAI. It is an initiative that shows what a proactive, optimistic approach to AI systems and free knowledge can look like. In the second part of this post, I will outline this approach in more detail.


The second part of this case study can be found here.

I’m grateful to Anna Mazgal and Paul Keller for their feedback on this piece. Anna’s “permaculture of open” concept is an excellent way of framing the new open. On the other hand, Paul has provided valuable ideas about how the Share Alike clause, fundamental to Wikipedia’s model of free knowledge, breaks into AI development life cycles.

Alek Tarkowski
keep up to date
and subscribe
to our newsletter
Subscribe