Common Corpus: building AI as Commons

At our April AI and the Commons community call, we heard from Pierre-Carl Langlais, a digital humanities researcher, Wikipedian, and a passionate advocate for open science. Pierre-Carl, is also the co-founder of a French AI startup, Pleias, and the coordinator of Common Corpus, a public domain dataset for training LLMs. During the call, he talked about ways in which generative AI models can be designed and built as a Commons.

This is the approach adopted by his company, Pleias, guided by a mission to build open science LLMs trained on open sources of data and shared under permissible licenses. AI as the Commons means that the data sets are also ethically sourced and culturally diverse. In his talk, Pierre-Carl walked us through ways of building such data sets and using them to create pre-trained models. In the last part of his talk, he described fine-tuning as a commons-based practice in which communities adapt models to their needs.

Pierre-Carl opened his talk by saying that AI does not constitute a form of autonomous intelligence – its capacities stem directly from the training corpus. Thanks to the extension of current models’ contextual windows, large language models have de facto become cultural models that are able to deal with a variety of formats and languages. That is why the question of data, how it is sourced and curated is so crucial.

Pierre-Carl used the example of GPT-3 to illustrate the precariousness of existingdata sets. Content available online is repurposed as “free,” web archives meant to preserve content have become sources for training commercial models, and some of the content used is most probably pirated. These issues are not just about intellectual property – “data issues are cultural issues”. One example given by Pierre-Carl was the data annotation work done by outsourced labour in the majority world. The case of these exploitative practices is well known – but only now it is becoming clear that it is potentially a source of linguistic bias for the models, as they adapt the conversational style of people performing the model refinement work.

We agree with Pierre-Carl that, in its current shape, “LLMs are a corrosive force for the Digital Commons.” Without safeguards and well-established, public-purpose-driven practices, technology development can disproportionately serve the interests of parties with more economic power and resources. In our recent white paper on commons-based data set governance, we note that

In the context of AI development, this is particularly true for companies that did best in the previous wave of digital innovation and enjoy significant economies of scale, to the detriment of parties that contribute to the commons but lack such power.

Pierre-Carl made also a key point about the ongoing-push for counter regulation –with the New York Times suing Microsoft and OpenAI for infringement of the publisher’s copyrights. This will lead AI companies to sign licensing deals with rightsholders. There are first examples of such deals: between Reddit and Google, or Le Monde and OpenAI. A licensed approach to model training creates one more risk of gatekeeping, as only the largest companies will be able to afford the licensing costs. Launched a bit over a month ago, Common Corpus is an attempt to address these challenges by presenting a new way of contributing to the development of AI as Commons.

As the largest training data set for language models based on open content to date, Common Corpus is built with open data, including administrative data as well as cultural and open-science resources – like CC-licensed YouTube videos, 21 million digitized newspapers, and millions of books, among others. With 180 billion words, it is currently the largest English-speaking data set, but it is also multilingual and leads in terms of open data sets in French (110 billion words), German (30 billion words), Spanish, Dutch, and Italian. Developing Common Corpus was an international effort involving a spectrum of stakeholders from the French Ministry of Culture to digital heritage researchers and open science LLM community, including companies such as HuggingFace, Occiglot, Eleuther, and Nomic AI. The collaborative effort behind building the data set reflects a vision of fostering a culture of openness and accessibility in AI research. Releasing Common Corpus is an attempt at democratizing access to large, quality data sets, which can be used for LLM training. Common Corpus aims to become a key component of a wider pretraining Commons ecosystem such as the “licensed” Pile currently prepared by Eleuther.

During the call, Pierre-Carl spoke about measuring the quality of AI systems not just by building ethically sourced, culturally diverse data corpuses but also about the practice of fine-tuning: adjusting a pre-trained language model to a specific task or domain. Fine-tuning is an important element of a vision for AI as commons, as it allows general, pre-trained models to be adjusted by communities for their own needs.

One example here could be the work of crowdsourced annotation carried on with Argillia in the Hugging Face Spaces, another the creation of Albert – a conversational agent that uses official French data sources to answer administrative questions (currently in beta) fine-tuned with the help of French public workers. Both of these practices could serve as a blueprint for creating more community-focused AI solutions. Fine-tuning also allows LLMs to be adapted to other languages, a method that is particularly promising for smaller languages. Pleias is conducting work to build such models in Ukraine and Senegal.

In essence, the Common Corpus embodies a fundamental aspiration: to democratize LLM innovation by supporting strong Data Commons and ensuring a more equitable and diverse future for models’ training and deployment so that AI technologies could function as Commons. Both the what goes into training the model and the how – the process of fine-tuning – seem to be almost political decisions at this point, said Langlais:

It is all about what ideology and culture we are going to reproduce with the help of this new cultural technology.