Feature For artificial intelligence researchers, the launch of OpenAI’s ChatGPT on November 30, 2022, changed the world in a way similar to the detonation of the first atomic bomb.
The Trinity test, in New Mexico on July 16, 1945, marked the beginning of the atomic age. One manifestation of that moment was the contamination of metals manufactured after that date – as airborne particulates left over from Trinity and other nuclear weapons permeated the environment.
Everyone participating in generative AI is polluting the data supply for everyone
The poisoned metals interfered with the function of sensitive medical and technical equipment. So until recently, scientists involved in the production of those devices sought metals uncontaminated by background radiation, referred to as low-background steel, low-background lead, and so on.
One source of low-background steel was the German naval fleet that Admiral Ludwig von Reuter scuttled in 1919 to keep the ships from the British.
More about that later.
Shortly after the debut of ChatGPT, academics and technologists started to wonder if the recent explosion in AI models has also created contamination.
Their concern is that AI models are being trained with synthetic data created by AI models. Subsequent generations of AI models may therefore become less and less reliable, a state known as AI model collapse.
In March 2023, John Graham-Cumming, then CTO of Cloudflare and now a board member, registered the web domain lowbackgroundsteel.ai and began posting about various sources of data compiled prior to the 2022 AI explosion, such as the Arctic Code Vault (a snapshot of GitHub repos from 02/02/2020).
The Register asked Graham-Cumming whether he came up with the low-background steel analogy, but he said he didn’t recall.
“I knew about low-background steel from reading about it years ago,” he responded by email. “And I’d done some machine learning stuff in the early 2000s for [automatic email classification tool] POPFile. It was an analogy that just popped into my head and I liked the idea of a repository of known human-created stuff. Hence the site.”
Is collapse a real crisis?
Graham-Cumming isn’t sure contaminated AI corpuses is a problem.
“The interesting question is ‘Does this matter?'” he asked.
Some AI researchers think it does and that AI model collapse is concerning. The year after ChatGPT’s debut several academic papers explored the potential consequences of model collapse or Model Autophagy Disorder (MAD), as one set of authors termed the issue. The Register interviewed one of the authors of those papers, Ilia Shumailov, in early 2024.
Though AI practitioners have argued that model collapse can be mitigated, the extent to which that’s true remains a matter of ongoing debate.
Just last week, Apple researchers entered the fray with an analysis of model collapse in large reasoning models (e.g. OpenAI’s o1/o3, DeepSeek-R1, Claude 3.7 Sonnet Thinking, and Gemini Thinking), only to have their conclusions challenged by Alex Lawsen, senior program associate with Open Philanthropy, with help from AI model Claude Opus.
Essentially, Lawsen argued that Apple’s reasoning evaluation tests, which found reasoning models fail at a certain level of complexity, were flawed because they forced the models to write more tokens than they could accommodate.
- Meta offered one AI researcher at least $10,000,000 to join up
- Enterprise AI adoption stalls as inferencing costs confound cloud customers
- I’m just a Barbie Girl in a ChatGPT world
- AI coding tools are like that helpful but untrustworthy friend, devs say
In December 2024, academics affiliated with several universities reiterated concerns about model collapse in a paper titled “Legal Aspects of Access to Human-Generated Data and Other Essential Inputs for AI Training.”
They contended the world needs sources of clean data, akin to low-background steel, to maintain the function of AI models and to preserve competition.
“I often say that the greatest contribution to nuclear medicine in the world was the German admiral who scuppered the fleet in 1919,” Maurice Chiodo, research associate at the Centre for the Study of Existential Risk at the University of Cambridge and one of the co-authors, told The Register. “Because that enabled us to have this almost infinite supply of low-background steel. If it weren’t for that, we’d be kind of stuck.
“So the analogy works here because you need something that happened before a certain date. Now here the date is more flexible, let’s say 2022. But if you’re collecting data before 2022 you’re fairly confident that it has minimal, if any, contamination from generative AI. Everything before the date is ‘safe, fine, clean,’ everything after that is ‘dirty.'”
What Chiodo and his co-authors – John Burden, Henning Grosse Ruse-Khan, Lisa Markschies, Dennis Müller, Seán Ó hÉigeartaigh, Rupprecht Podszun, and Herbert Zech – worry about is not so much that models fed on their own output will produce unreliable information, but that access to supplies of clean data will confer a competitive advantage to early market entrants.
With AI model-makers spewing more and more generative AI data on a daily basis, AI startups will find it harder to obtain quality training data, creating a lockout effect that makes their models more susceptible to collapse and reinforces the power of dominant players. That’s their theory, anyway.
You can build a very usable model that lies. You can build quite a useless model that tells the truth
“So it’s not just about the sort of epistemic security of information and what we see is true, but it’s what it takes to build a generative AI, a large-range model, so that it produces output that’s comprehensible and that’s somehow usable,” Chiodo said. “You can build a very usable model that lies. You can build quite a useless model that tells the truth.”
Rupprecht Podszun, professor of civil and competition law at Heinrich Heine University Düsseldorf and a co-author, said, “If you look at email data or human communication data – which pre-2022 is really data which was typed in by human beings and sort of reflected their style of communication – that’s much more useful [for AI training] than getting what a chatbot communicated after 2022.”
Podszun said the accuracy of the content matters less than the style and the creativity of the ideas during real human interaction.
Chiodo said everyone participating in