The New York Times (NYT) had managed to remove its content from major AI training datasets.NYT has found that its paywalled and other copyrighted content was being used without approval or compensation in major AI training datasets, including Common Crawl and WebText. The media outlet asked the Common Crawl Foundation to pull its content from the dataset after updating its Terms of Service to prohibit the future use of its content to train AI-powered large language models (LLMs).
Charlie Stadlander, a spokesman at the NYT, confirmed to Business Insider, 'We simply asked that our content be removed and were pleased that Common Crawl complied with our request and recognized the Time's ownership of our quality journalistic content.'
The move is significant as Common Crawl is one of the largest AI training datasets, built by scraping the web with CCBot, a crawling software. It provides training data infrastructure for many LLMs, including OpenAI's GPTs. The NYT is not alone in claiming that ChatGPT-maker OpenAI has illegally scraped training data. There have been several lawsuits filed against OpenAI and its major partner and investor, Microsoft. The 'data revolts' against LLMs are led by news organizations, social media firms, publishers, and authors.
The NYT and other content creators are pushing back against the use of their copyrighted content in AI training datasets. According to Originality.ai, as of late September, almost 14% of the 1,000 most popular websites are blocking CCBot, including Amazon, CNN, Reuters, The New Yorker, The Atlantic, and Vimeo.
The Times and other news publishers are also exploring ways to charge AI companies for their data. It remains to be seen if a domino effect will impact tech giants like Meta, Google, and Microsoft, or if their massive amounts of proprietary data and huge resources shield them.