Last week, the Data Provenance Initiative at MIT released a new paper by Shayne Longpre et al. that shows a dramatic increase in restrictions on the use of publicly available content as AI training data. This first large-scale longitudinal study of the restrictions placed on online content via robots.txt and terms of service shows that over the past year, restrictions on online content included in a number of commonly used AI training datasets have increased dramatically. From virtually no restrictions on use as training data just a year ago, the researchers found that more than 28% of the most actively maintained, critical sources for the C4 training dataset are now completely restricted from use via robots.txt.
The paper documents a sharp increase in such restrictions starting in the fall of 2023, which coincides with the time when Open AI, Google, and others began documenting the ability to use robots.txt to block their crawlers from ingesting publicly available content. The paper shows that the pushback against AI tools from content creators and website owners who object to their work being used for AI training purposes is not only real, but is becoming a major issue for AI companies that rely on publicly available online content as training data. The authors suggest that this will be particularly problematic for smaller companies and research projects that do not have the resources to license such content.
In Europe, where research uses are allowed under a mandatory exception to copyright that cannot be overridden by contract or technological measures, the negative impact on researchers is likely to be more limited than the authors fear. However, there are many other beneficial uses, such as search or web archiving, that will be affected by blanket restrictions via robots.txt and other means. In this context, the authors point to the need for better protocols, which is very much in line with our arguments for standardized rights holder opt-outs. From the concluding section of the paper:
The web needs better protocols to communicate intentions and consent. The [Robots Exclusions Protocol] places an immense burden on website owners to correctly anticipate all agents who may crawl their domain for undesired downstream use cases. We consistently find this leads to protocol implementations that don’t reflect intended consent. An alternative scheme might give website owners control over how their webpages are used rather than who can use them. This would involve standardizing a taxonomy that better represents downstream use cases, e.g. allowing domain owners to specify that web crawling only be used for search engines, or only for non-commercial AI, or only for AI that attributes outputs to their source data. New commands could also set extended restriction periods given dynamic sites may want to block crawlers for extended periods of time, e.g. for journalists to protect their data freshness. Ultimately, a new protocol should lead to website owners having greater capacity to self-sort consensual from non-consensual uses, implementing machine-readable instructions that approximate the natural language instructions in their Terms of Service.
Both the New York Times and 404 media have published articles that go into more detail on the paper.