Today, Open AI announced that
GPTBot, the web crawler used to collect training data for its GPT series of large language models, can now be blocked via the
robots.txt protocol. Site administrators can either disallow crawling of entire sites or create custom rules that allow `GPTBot` access to some parts of a site while blocking it from others. This functionality gives site owners a level of control over how their content is used by OpenAI’s LLMs that they previously lacked.
At first glance, OpenAI’s approach follows the opt-out mechanism established by the TDM exceptions in the EU copyright framework. But on closer inspection, the model/vendor-specific nature of this approach raises more questions than it answers, as it implies that it is the responsibility of website publishers to set rules for each individual ML training crawler operating on the web, rather than setting default permissions that apply to all ML training crawlers.