By Kevin Roose
For years, the people building powerful artificial intelligence systems have used enormous troves of text, images and videos pulled from the internet to train their models.
For years, the people building powerful artificial intelligence systems have used enormous troves of text, images and videos pulled from the internet to train their models.
Now, that data is drying up.
Over the past year, many of the most important web sources used for training AI models have restricted the use of their data, according to a study published this week by the Data Provenance Initiative, an MIT-led research group.
The study, which looked at 14,000 web domains that are included in three commonly used AI training data sets, discovered an “emerging crisis in consent,” as publishers and online platforms have taken steps to prevent their data from being harvested.
The researchers estimate that in the three data sets — called C4, RefinedWeb and Dolma — 5 per cent of all data, and 25 per cent of data from the highest-quality sources, has been restricted. Those restrictions are set up through the Robots Exclusion Protocol, a decades-old method for website owners to prevent automated bots from crawling their pages using a file called robots.txt.
Also Read
The study also found that as much as 45 per cent of the data in one set, C4, had been restricted by websites’ terms of service. “We’re seeing a rapid decline in consent to use data across the web that will have ramifications not just for AI companies, but for researchers, academics and noncommercial entities,” said Shayne Longpre, the study’s lead author, in an interview.
Data is the main ingredient in today’s generative AI systems, which are fed billions of examples of text, images and videos. Much of that data is scraped from public websites by researchers and compiled in large data sets, which can be downloaded and freely used, or supplemented with data from other sources. Learning from that data is what allows generative AI tools like OpenAI’s ChatGPT, Google’s Gemini and Anthropic’s Claude to write, code and generate images and videos. The more high-quality data is fed into these models, the better their outputs generally are.
For years, AI developers were able to gather data fairly easily. But the generative AI boom of the past few years has led to tensions with the owners of that data — many of whom have misgivings about being used as AI training fodder, or at least want to be paid for it. As the backlash has grown, some publishers have set up paywalls or changed their terms of service to limit the use of their data for AI training. Others have blocked the automated web crawlers used by companies like OpenAI, Anthropic and Google.
Sites like Reddit and StackOverflow have begun charging A.I. companies for access to data, and a few publishers have taken legal action — including The New York Times, which sued OpenAI and Microsoft for copyright infringement last year, alleging that the companies used news articles to train their models without permission.
Companies like OpenAI, Google, and Meta have gone to extreme lengths in recent years to gather more data to improve their systems.More recently, some AI companies have struck deals with publishers including The Associated Press and News Corp, the owner of The Wall Street Journal, giving them ongoing access to their content.
©2024 The New York Times News Service