Common Corpus public domain data set released

A group of AI researchers coordinated by the French start-up Pleias wants to challenge the belief that you need copyrighted materials to train an LLM that competes with the models developed by leading AI companies. Yesterday, they released what has been dubbed the largest open AI training data set consisting entirely of public-domain texts. The collection is called “Common Corpus” and is available on Hugging Face for download. The resource is multilingual – besides English, it includes the largest open collections in French, German, Spanish, Dutch, and Italian, as well as collections for other languages.

Training data is a key resource for developing AI systems. Until very recently, it was commonly believed that LLMs, such as those behind popular services such as ChatGPT or Bard, could not be trained without relying on copyrighted content. If this is the case, access to high-quality data may continue to be a significant barrier for independent AI developers seeking to compete in the LLM market.

Datasets consisting only of public domain texts have significant limitations, the most important being that they miss more contemporary information because they are comprised of historical sources or older publications where copyrights have already expired. It remains to be seen whether public domain datasets can indeed compete with datasets containing more contemporary content that is protected by copyright.