AI and the Commons: building AI datasets for the future

Could approaching datasets as archives improve the quality of generative AI?
February 16, 2024

At our first 2024 AI and the Commons Community Call, we were joined by Eryk Salvaggio, an interdisciplinary researcher, lecturer, and artist who works with digital media and AI.

AI and Wikimedia Commons

Eryk has prepared a short talk, a spin of a longer talk he gave at the Wikipedia NYC “WikiDay 2024” a few days prior. You can watch this 40 minutes talk archived by the Internet Society here.

The talk highlighted the lessons that can be learned from the success of the Wikimedia project in the field of AI, some of which we have also blogged about before.

Datasets vs.  archives

One of the main points made by Eryk is that if the knowledge fed into generative AI models were created in ways that would resemble a Wikipedia-like submission and governance process, it would simply be better:

While we all agree that archives like Wikipedia are faulty and biased, there are at least humans having conversations about them and mediating this data.

If we approached AI data repositories as archives, there would be more transparency, agency, and responsibility assigned to those who contribute to the sets as well as those maintaining and curating them. Eryk argued that we could then “add meaning and context” to pure data and, thus, make these models more complex and useful.

Who should take up the role of AI Stewards?

During the call we have discussed why some places like libraries or newspapers are purposefully blocking data scraping bots in an effort to avoid exploatation by commercial AI models. We also talked about the ways in which the missing context could be added to digital culture collections, such as adding metadata to existing collections or exploring value-driven design approach.

Who should be doing this work? Should it be a volunteer-driven, international Foundation like in the case of Wikimedia, or state-funded government institutions?

As we talked about in one of our former conversation about the “Open (For Business): Big Tech, Concentrated Power, and the Political Economy of Open AI,” paper, the costs of developing generative AI models are so high that to make sure AI can be developed and curated in societally beneficial ways, we need significant, public-oriented funding. We wrote about it in our recent publication on EU Policies for the Digital Commons.

The discussion about governing and curating AI is most definitely to be continued and we are hoping to explore other aspects of it in the upcoming AI and the Commons calls.

AI and the Commons community calls are invite-only conversations. If you’d like to join them, email

Alicja Peszkowska
keep up to date
and subscribe
to our newsletter