Stewarding the sum of all knowledge in the age of AI

In my first opinion about Wikipedia’s role in the emerging field of generative AI, I outlined a potential WikiAI mission: I argued that stewards of Wikipedia, and other free knowledge or digital commons ecosystems, urgently need to develop their approach to machine learning / artificial intelligence technologies. The challenge is not just to protect the commons from exploitation. The goal also needs to be the development of approaches that support the commons in a new technological context, which changes how culture and knowledge are produced, shared, and used.

I focus on Wikipedia (and the broader Wikimedia Movement) as the most significant commons-based ecosystem and a crucial content source for training generative AI systems. But the arguments that I am presenting also apply to other commons-based platforms: Open Access repositories in the research sector, Open Data repositories managed by governments, and heritage collections — both those of public institutions (like those aggregated by Europeana) and civic projects like the Internet Archive or Flickr.org.

The Wikimedia movement recently took steps toward dealing with generative AI. At the beginning of the year, a group of Wikimedians created a policy proposal for dealing with large language models, focused on their use by editors. And the draft 2023/2024 work plan of the Wikimedia Foundation includes exploring how Wikipedia might be used when chatbots serve as intermediaries to its content.

Still, we need a more holistic approach that considers how machine learning technologies impact Wikimedia — changes to editing, disintermediation of users, and governance of free knowledge as a resource used in AI training. These changes call for an overall strategy that balances the need to protect the organization from negative impact and harms with the need to deploy new technologies in productive ways to help build the digital commons.

Open and responsible: paths toward opening the AI stack

Such an overall strategy for a commons-based approach to generative AI technologies can be based on one of two approaches. This choice is demonstrated by the stances taken by organizations and companies championing open source AI: Eleuther.AI, Stability.AI, and HuggingFace.

Until last year, there was a shared sense that machine learning systems are destined to be developed in closed approaches that help concentrate power in the hands of large corporations. Open AI, the leading company in this space, admits to pivoting from open sharing to gated access.

Already in 2020, Eleuther.AI was researching and building AI technologies with open principles in mind, believing that the ability to study foundation AI models should not be restricted to a handful of companies. Eleuther.AI developed or helped develop key openly available AI models and the Pile, a dataset that has played a crucial role in LLM development. Yet research done by Eleuther.AI has not been well known to the public.

So it was only in 2022 that open source AI development became more broadly discussed. This was triggered by the simultaneous release of two new large generative models: Stable Diffusion, a text-to-image model created by Stability.AI, and BLOOM, a large language model peer-produced through a project led by HuggingFace.

There are apparent ideological differences between the two companies. The first combines a commitment to open source with a laissez-faire ethos focused on innovation^{[1]Elettra Bietti argues that a laissez-faire approach is part of the genealogy of open initiatives: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3859487.}. Admittedly, the company is taking some steps that can be seen as more nuanced regarding the principle of responsibility — for example, by exploring opt-out mechanisms for training data.

The second company, HuggingFace, aims to “democratize good machine learning, one commit at a time.” HuggingFace is a platform that shares open models, datasets, and libraries and actively develops some of the solutions. It demonstrates a relatively traditional take on open sourcing a technological stack. In addition, HuggingFace aims to “democratize” the space. For example, they seek technological innovations that make ML solutions more accessible. Or run projects like BigScience, where the aim was not just to open source a model but to set a standard for responsible development and participatory governance.

The approach taken by HugginFace feels very close to the current Wikimedia Movement Strategy – with its twin commitment to free knowledge as a service and knowledge equity – only encoded into the machine learning stack. “Democratizing AI” is a solid principle to for building the WikiAI mission. Fulfilling such mission however, would require Wikimedians to take additional steps to properly govern the use of AI in Wikimedia projects and the use of Wikimedia resources in AI development.

Without these steps, Wikimedia might follow the other path, where a commitment to open source principles does not necessarily secure broader societal goals and fails to build a “permaculture of open.”

Governance and partnerships as principles of WikiAI

As a starting point, governance and partnerships need to be established as principles for WikiAI development. Without a strong commitment to these principles, the accepted mode of commons-based peer-production on Wikipedia will prove insufficient to deal with these challenges.

Both the value of governance and partnerships have been intensely debated in the context of the current Wikimedia Movement strategy and its implementation. However,I would argue that so far, both principles are aspirational so far. For each of them, the movement could do more to bring them to life truly — both in terms of better mechanisms for participatory governance and expanding the movement through partnerships. Managing the use of machine learning technologies and their impact creates an opportunity to strengthen governance and partnerships.

Due to how they use (or even exploit) the commons and how they contribute to the commons with synthetic content, machine learning systems create major challenges for the Wikimedia Movement. They require participatory governance — and provide the best opportunity to explore new ways of governing Wikimedia projects. Participatory governance of AI is a hot topic, with initiatives launched by the biggest companies in this space: Open AI and Meta. Democracy Next has also proposed a civic initiative on the issue based on the citizen assembly methodology. And a recent study by Wikimedia Deutschland and the Platoniq foundation explored participatory governance in Wikimedia projects. A “Wikimedian assembly on AI” could bring real value to the movement.

Regarding partnerships, Wikimedia should build relations with organizations like Eleuther.ai and companies like HuggingFace, so that the expertise of responsible AI developers can be combined with that of the stewards of the digital commons.

Thinking more bravely, all this creates room to define a mission-based approach for the broad open movement — and test whether Wikimedia can see itself as a steward that works in solidarity with the Wikimedia Movement and broader open ecosystems.

The growth of AI systems, due to the cross-cutting character of the issue, provides a vast opportunity to engage in such collaboration – which should include both the prominent players in the movement like Creative Commons and Mozilla, but also many more minor actors. The planetary scale of this effort and the need to consider content and languages from around the world make this a mission well-suited to inspire such global cooperation.

Last but not least, the partnership principle could also address something more fundamental — the overall approach to AI development should be based on a partnership between human Wikimedians and machine actors.

Elements of WikiAI

To sum it up, the development of WikiAI – a commons-based, peer-produced AI stack seen as an integral part of the Wikimedia effort – should cover the following core elements:

Commons-based datasets

For years, Wikimedia has been a source of data for other projects. Search engines like Google have been using Wikipedia content in infoboxes (fundamentally shifting, in the process, the information flows in which Wikipedia is situated). Photographs from Wikipedia have been used for face recognition training for almost a decade.

Recently, content from Wikimedia projects has also been used for training most of the significant large language models. And Wikipedia content is currently crucial for improving the models’ quality. In sum, knowledge commons built by the Wikimedia Movement – packaged into training datasets – have proven to be a valuable resource for AI development. As part of the WikiAI effort, we should develop governance mechanisms to make sourcing such datasets more sustainable. In other words, there is a need to go beyond Wikidata and design and build the Wikidataset.

The Wikimedia Enterprise project could be part of such governance: a voluntary payment model that secures the sustainability of the commons at scale. Getting major AI companies to join the program would be a significant step forward — and could provide funding for the WikiAI endeavor.

Yet more has to be done than ensuring sustainable sourcing of wiki content. We need for commons-based dataset governance: datasets that are shared openly but also address such issues as bias avoidance, sustainable and fair data collection, and participatory governance. Wikimedia is well-positioned to build such datasets and explore issues around their governance. We also need explore other governance mechanisms beyond those tied to copyright licensing. New data governance intermediaries – such as data cooperatives and data trusts – could be explored as governance mechanisms.

Augmented peer production

The use of generative AI for wiki-content generation is a reality. And therefore, we should design new modes of content production that provide guidelines and rules for such contributions while trying to avoid the following two scenarios along the way.

The first one is that of full automation. For example, automatic translation will soon enable articles to be machine-generated for many minor languages on Wikipedia, filling the content gap. Automation could be a significant source of free knowledge, but one that lacks the human peer-production aspect, which defines the permaculture-based approach I am arguing for. There are also well-defined risks of bias and even disinformation related to automatic content generation, which would also affect Wikimedia. This all leads to an important conclusion: that commons must be, at least to some extent, created by humans.

The other negative scenario is that of a conservative approach that shuns all automation. Today, some of the wiki work with data is extremely tedious, and Wikimedians are contributing by solving the wiki equivalent of captchas: for example, reviewing image descriptions for Wikidata items. Tasks like this should be automated — in line with a long tradition of bot-building in Wikimedia.

Fortunately, these challenges were already explored in 2019 by the Wikimedia Product Team, which published a report on “Augmentation” — referring to a concept used by AI researchers to describe an alternative to full automation of human work. The report charts a middle path by signalling the “need to continuously recognize and embrace augmentation as a major way to contribute to the wikis” in three areas: content generation, content curation, andgovernance/community conduct.The rise of generative AI will lead to increased exploration of the automation of work and ways of enabling or even protecting human labor. Trends will particularly affect peer production environments, as open ecosystems lack barriers to deploying automated methods. Vice versa, good practices developed by Wikimedians can set an example for other knowledge workers (and Wikimedians should, in turn, carefully observe other platforms as they establish new approaches and rules).

This work should start with a seemingly simple issue: whether and how to distinguish human and machine-generated content through some system of content marking. Such distinction between genuine and synthetic content should be at the heart of augmented wiki work, with appropriate rules defined for the two types of content and different shades of their combinations.

Wiki as a model and as a chatbot

Who will our children ask for information? This is the fundamental question that Wikimedians need to ask themselves. The shift that occurred when commercial search engines began reusing Wikimedia content signaled a significant change in Wikimedia’s ecosystem (one that the Wikimedia Foundation acknowledges in its review of external trends and its current work plan).

With AI agents, this trend will be exacerbated. It will further disintermediate Wikimedia, as the commons-based data and content will be fed and obscured by the black boxes that are AI models, chatbots, and agents. This points to the need to develop WikiChat — an AI-based, conversational interface to Wikimedia.

The goal would be to build on the foundations of the Wikidataset and build (at best in partnership) its own generative model. One that is open but also governed democratically and designed responsibly. And one that gives Wikimedians full knowledge and control over the whole system as a new interface to Wikimedia content. Once again, Wikimedia is positioned to define a standard concerning transparency, traceability, attribution, bias mitigation, or output licensing.

Stewarding free knowledge for a new Web

At this time, it is unclear what the future of knowledge exploration will be. What role will encyclopedias, search engines, and chat agents play? Will new forms emerge? And will any “die”? James Vincent recently argued that AI is “killing the old Web,” — but it’s still unclear what the new Web will look like. The WikiAI mission would allow Wikimedia, and its partners, to shape this emerging ecosystem.

Wikipedia’s heavily text-based design has worked well. For the last two decades, it has made wiki content machine-readable and, therefore, very useful for the current phase of AI development.

Wikimedia has not, for example, developed an audiovisual format that would align with the current trends of social network use — there are no “Wikimedia shorts.” (The recently announced sound logo does signal an understanding that Wikimedia content is accessed in new ways, in this case through voice assistants).

While I am not arguing for replacing the current Wikipedia with a chat input box, I believe a chat-based sandbox should be built as soon as possible to actively explore possible scenarios. If AI-based services are the next step in Web’s development, it is crucial to include a significant civic AI platform. And yes, we could imagine a future where Wikipedia is a virtual agent, talking both with its creators and users.