Select Language

English

Down Icon

Select Country

Netherlands

Down Icon

Two million Dutch news items removed from AI database

Two million Dutch news items removed from AI database

The issue is Common Crawl, a so-called scraper from an American non-profit organization that creates copies of countless websites. These copies are freely available for anyone to use, including to train AI models.

Common Crawl currently contains 2.6 billion web pages. Nearly all major AI models use the collection, including ChatGPT, Claude, and Deepseek.

The scraped websites also include tens of thousands of Dutch pages, ranging from small websites to large news platforms. Brein determined that the database contained articles from Dutch news websites and digital newspapers, among others, that had been copied without permission.

News websites are a vital source of information for language models and AI chatbots. This also poses a threat to these same sites, as AI can reduce visitor numbers, resulting in lower revenue for news sites.

NDP Nieuwsmedia, the trade association for news companies, argues that AI companies are "parasitizing on the work of journalists" by using these types of scrapers.

"It's very damaging for authors and publishers that their texts are used without permission," Bastiaan van Ramshorst, director of Brein, told RTL Z. "That's why we've requested on behalf of several publishers to take those articles offline."

According to Van Ramshorst, Common Crawl responded quickly to the request, but it will take some time before all the articles are offline. "That's because it's such a large database. That also made it difficult to determine exactly which articles are in it."

The fact that the articles are no longer in this database doesn't mean they won't appear in AI models at all. Existing models have already processed the articles, and they won't disappear from them. Moreover, AI companies are also building their own scrapers, but it's unclear whether they contain copyrighted data.

"If such a model isn't transparent, it's very difficult to determine the underlying data," says Van Ramshorst. "We do research that, but it's quite time-consuming."

A small silver lining: next year a new European law, the AI ​​Act, will come into effect, requiring AI companies to be more transparent about their sources.

Besides news reports and other text, music is also used to train AI. This video shows how The Velvet Sundown racks up millions of streams, even though the band doesn't even exist:

RTL Nieuws

RTL Nieuws

Similar News

All News
Animated ArrowAnimated ArrowAnimated Arrow