AI Speaks Polish

How open models drive generative AI development in smaller markets
November 28, 2024

Progress in the development of generative artificial intelligence (AI) technologies is increasingly met with a concern over concentration of power and uneven allocation of AI’s benefits. Commercial foundation models are trained in a paradigm that assumes a constant scaling of technologies, which drives both market competition and technological power. It might therefore seem that the economics of creating AI technologies preclude the emergence of alternatives – whether publicly funded or created by smaller commercial players.

However, a steady stream of alternative solutions has been in development, in parallel to the dominant solutions, and benefitting from the same, openly shared research and open source technologies. In 2022, when OpenAI released the ChatGPT service, the open source BLOOM model, created by a community of researchers backed by Hugging Face, was released simultaneously. The growth of alternative solutions is increasingly seen as one of the ways, alongside regulation, to curb concentrations of power in AI, and thus to democratize AI development.

Today, the new paradigm of creating small language models and the availability of open foundation models makes it possible to efficiently create new language models – particularly those that address language gaps in generative AI development. This report presents a case study of a Polish ecosystem, in which open language models are being developed as Digital Commons. These cases of model development are examples of a public AI approach: of building infrastructure for the common good and with public orientation of AI in mind.

This report focuses on two such initiatives that are based in Poland. SpeakLeash is a community that has been building a Polish language dataset, and in April 2024 built on its basis Bielik, a Polish small language model. And PLLuM (Polish Large Language Model) is a consortium of public research institutions that aims to create a language model also tailored to the specificity of the Polish language. We also demonstrate how a broader ecosystem has emerged around these initiatives.

The report is based on interviews with the creators of Polish models. Based on them, we analysed model development processes and challenges that they have to solve. We also draw conclusions from their achievements that can be help to support such initiatives in the future.

The goal of the report is to raise awareness that open language models are being developed to reduce language gaps in AI development and provide development for alternative technologies. Learnings from the case studies can also help in formulating public policies to support the development of such alternatives. Our key findings include:

A Polish version of the report is available on the website of Centrum Cyfrowe, our partner in writing this report.

Read the publication

 

Alek Tarkowski
with: Kuba Piwowar (Centrum Cyfrowe), Michał Owczarek (SWPS University)
download as PDF:
keep up to date
and subscribe
to our newsletter
Subscribe