Common Crawl Criticized for 'Quietly Funneling Paywalled Articles to AI Developers' - Slashdot

For more than a decade, the nonprofit Common Crawl “has been scraping billions of webpages to build a massive archive of the internet,” notes the Atlantic, making it freely available for research.
“In recent years, however, this archive has been put to a controversial purpose: AI companies including OpenAI, Google, Anthropic, Nvidia, Meta, and Amazon have used it to train large language models.

“In the process, my reporting has found, Common Crawl has opened a back door for AI companies to train their models with paywalled articles from major news websites. And the foundation appears to be lying to publishers about this — as well as masking the actual contents of its archives…”

Common Crawl’s website states that it scrapes the internet for “freely available content” without “going behind any ‘paywalls.'” Yet the organization has taken articles from major news websites that people normally have to pay for — allowing AI companies to train their LLMs on high-quality journalism for free. Meanwhile, Common Crawl’s executive director, Rich Skrenta, has publicly made the case that AI models should be able to access anything on the internet. “The robots are people too,” he told me, and should therefore be allowed to “read the books” for free. Multiple news publishers have requested that Common Crawl remove their articles to prevent exactly this use. Common Crawl says it complies with these requests. But my research shows that it does not.

I’ve discovered that pages downloaded by Common Crawl have appeared in the training data of thousands of AI models. As Stefan Baack, a researcher formerly at Mozilla, has written, “Generative AI in its current form would probably not be possible without Common Crawl.” In 2020, OpenAI used Common Crawl’s archives to train GPT-3. OpenAI claimed that the program could generate “news articles which human evaluators have difficulty distinguishing from articles written by humans,” and in 2022, an iteration on that model, GPT-3.5, became the basis for ChatGPT, kicking off the ongoing generative-AI boom. Many different AI companies are now using publishers’ articles to train models that summarize and paraphrase the news, and are deploying those models in ways that steal readers from writers and publishers.

Common Crawl maintains that it is doing nothing wrong. I spoke with Skrenta twice while reporting this story. During the second conversation, I asked him about the foundation archiving news articles even after publishers have asked it to stop. Skrenta told me that these publishers are making a mistake by excluding themselves from “Search 2.0” — referring to the generative-AI products now widely being used to find information online — and said that, anyway, it is the publishers that made their work available in the first place. “You shouldn’t have put your content on the internet if you didn’t want it to be on the internet,” he said. Common Crawl doesn’t log in to the websites it scrapes, but its scraper is immune to some of the paywall mechanisms used by news publishers. For example, on many news websites, you can briefly see the full text of any article before your web browser executes the paywall code that checks whether you’re a subscriber and hides the content if you’re not. Common Crawl’s scraper never executes that code, so it gets the full articles.

Thus, by my estimate, the foundation’s archives contain millions of articles from news organizations around the world, including The Economist, the Los Angeles Times, The Wall Street Journal, The New York Times, The New Yorker, Harper’s, and The Atlantic…. A search for nytimes.com in any crawl from 2013 through 2022 shows a “no captures” result, when in fact there are articles from NYTimes.com in most of these crawls.
“In the past year, Common Crawl’s CCBot has become the scraper most widely blocked by the top 1,000 websites,” the article points out…

Source link

Noninvasive imaging could replace finger pricks for people with diabetes

The Best Tech Deals Under $25 Still Live After Cyber Monday

NASA Awards Lunar Freezer System Contract – NASA

Black hole entropy hints at a surprising truth about our universe

Squarespace Promo Codes: 10% Off in December 2025

Russia Wants This Mega Missile to Intimidate the West, but It Keeps Crashing

6G’s Role in Early Tsunami Detection

Here’s How Much Apple, Meta, Google and More Pay Employees

UK Enacts Law Recognizing Digital Assets as Personal Property –

Kalshi Becomes CNN’s Official Prediction Market Partner for Real-Time Event Forecasting

Crypto treasuries lead stock recovery after shaky start to December

Chainlink At A Turning Point: Triangle Pattern Holds, But One Line Must Break

NCIS: Origins Season 2 Episode 7 Is a Poignant Examination of Loneliness, Secrecy, Survival & Wheeler

Did Tony Hawk and Henry Rollins launch retirement home for aging punks?

Palm Springs Film Festival Announces 2026 Lineup | Festivals & Awards | Roger Ebert

Joy in Racing to Keep Up: Tom Stoppard (1937-2025) | | Roger Ebert

Common Crawl Criticized for ‘Quietly Funneling Paywalled Articles to AI Developers’ – Slashdot

Highlights

AI Is Creating New Winners and Losers. Here’s How Smart Leaders Are Restructuring to Get Ahead.

Video games can teach designers deeper lessons than ‘high score streaks’ and gamification | Fortune

A Supreme Court decision could put your internet access at risk. Here’s who could be affected | Fortune

Vanguard has a change of heart on crypto, lists Bitcoin and other ETFs | Fortune

China’s ByteDance could be forced to sell TikTok U.S., but its quiet lead in AI will help it survive—and maybe even thrive | Fortune

Latest News

Guinea-Bissau’s military takeover: Was the coup real or a ‘sham’?

Sanchar Saathi: India scraps order to pre-install state-run cyber safety app on smartphones

Italian fashion giant Prada buys Versace – at a discount

Air pollution: Delhi hospitals saw 200,000 respiratory illness patients in three years

Compliance Center