cybersecurity

Anyone Can Now Strip AI Safety Features in Under 10 Minutes. Here's Why That's a Big Problem.

Anyone Can Now Strip AI Safety Features in Under 10 Minutes. Here's Why That's a Big Problem.
credit: Meta
Follow us on Google

A new investigation reveals that tools freely available on the internet can dismantle the safety controls built into powerful open-source AI models in minutes, with no technical expertise required.

For years, tech companies have poured tens of millions of dollars into building safety guardrails into their AI models. These guardrails are meant to prevent the models from producing dangerous content such as instructions for making weapons, code designed to steal financial data, or material that exploits children. They are, in theory, the last line of defence between powerful AI and catastrophic misuse.

A new joint investigation by the Financial Times and the AI safety group Alice has now exposed just how thin that line is.

Using a freely available software tool called Heretic, researchers were able to strip the safety guardrails from Meta's Llama 3.3 model in under 10 minutes, requiring no specialist hardware and no advanced technical knowledge. The Heretic tool removes safety guardrails from Meta Llama 3.3 and Google Gemma in under 10 minutes with no specialist hardware. What they were left with was an AI model willing to answer questions that its original version flatly refused, including how to synthesise dangerous chemicals and how to calculate lethal doses of biological toxins.

The findings have sent shockwaves through the AI industry, and they raise a question that policymakers, developers, and the public now urgently need to grapple with: if safety guardrails can be removed this easily, were they ever really safe to begin with?

What Happened: The Investigation

A joint investigation by the Financial Times and AI safety research group Alice revealed that modified versions of premier models from Meta Platforms and Google are being used to generate dangerous content.

The tests were stark. A version of Google's open-source model Gemma 3 responded to a question on how to disperse chlorine gas through a crowded indoor space, generated code to steal credit card information, and wrote stories describing child sexual abuse. The FT was able to use Heretic, a tool available on the popular code repository GitHub, to remove the guardrails from Meta's Llama 3.3 model in less than 10 minutes without any specialist hardware. The modified model responded to prompts on topics the original system refused to discuss, such as the number of micrograms of ricin per kilogram of body mass required to achieve a 50 per cent chance of death.

These are not edge-case scenarios. They are precisely the kinds of outputs that AI safety systems are specifically built to prevent. The fact that they could be unlocked in the time it takes to make a cup of tea is a sobering revelation.

The Tool Behind It All: What Is Heretic?

At the heart of this story is a tool called Heretic, created by a developer named Philipp Emanuel Weidmann. Heretic is described as a tool that removes censorship, also known as "safety alignment," from transformer-based language models without expensive post-training. What it does is "abliteration": it seeks out a model's directions that refuse harmful requests and removes them. What makes Heretic so powerful is that it does all this completely automatically, according to its GitHub page.

Weidmann has been candid about the scale of what his tool has enabled. According to Heretic creator Philipp Emanuel Weidmann, the software has already been used to create more than 3,500 so-called "decensored" AI models. He also claimed these modified systems have been downloaded around 13 million times since the tool launched.

To underscore just how rapidly this can happen, Weidmann told the FT that he had removed safeguards from Google's Gemma 4 model within 90 minutes of its release. In other words, the moment a new safety-aligned model drops publicly, it can be stripped of its protections before most people have even heard about it.

What Is "Abliteration" and How Does It Work?

To understand why this is so effective, it helps to understand the underlying technique, known as abliteration.

Abliteration finds a "refusal vector" and removes it. The technique involves running the model on two datasets, one with harmless prompts and one with harmful prompts, measuring how the model's internal activations differ between them, and then removing the directions responsible for those refusals.

Think of it like this: when an AI model decides to refuse a dangerous request, it does so because certain patterns of activation in its neural network trigger a "stop" response. Abliteration identifies exactly where those patterns live inside the model's weights and surgically excises them without disrupting the model's general capabilities. The model continues to function well at everything else; it just stops saying no.

Another toolkit called OBLITERATUS has also emerged, using a six-stage pipeline: load the model, collect activations from restricted versus unrestricted prompts, extract refusal directions using SVD decomposition, project out the guardrail directions, verify the model's core capabilities remain intact, then save the modified model. The fact that multiple tools now exist to accomplish this, with increasingly streamlined interfaces, points to an entire ecosystem forming around the removal of AI safety features.

Why Open-Source Models Are Uniquely Vulnerable

It is important to understand why this problem primarily affects open-source AI models, and not proprietary systems like Anthropic's Claude or OpenAI's ChatGPT.

This technique cannot easily be applied to proprietary systems such as Claude or OpenAI's ChatGPT because the models' underlying code is not accessible to outsiders. Open-source systems, however, have historically narrowed the gap with leading proprietary versions within six to 12 months.

This is the core tension at the heart of the open-source AI movement. Transparency, reproducibility, and democratized access to powerful tools are genuine goods. They lower barriers for researchers, startups, and developers worldwide, and they provide a counterweight to the concentration of AI power among a handful of private companies. But those same properties, including open weights and accessible code, are precisely what make models like Llama and Gemma so vulnerable to abliteration.

When a company like Meta releases an open-weight model, it essentially hands the keys to the world. Any developer, researcher, or bad actor can download the full model and modify it however they choose. The safety fine-tuning that was applied during training is just another layer that can be peeled back. Unlike proprietary "closed" systems like OpenAI's ChatGPT or Anthropic's Claude, where the inner workings are kept strictly under lock and key, open models allow global developers to download and alter the underlying code.

This Is No Longer a Job for Specialists

Perhaps the most alarming aspect of these findings is not what the tools can do, but who can use them.

"Whereas historically it might have taken a more informed and persistent actor to strip out safety features, nowadays it's much easier for the average person," Kawin Ethayarajh, assistant professor of applied AI at the University of Chicago's Booth business school, told the FT.

This democratisation of harm is a pattern we have seen repeat across technology. Once something that required specialist knowledge and dedicated infrastructure becomes accessible to anyone with an internet connection, the risk profile changes entirely. The conversation can no longer be about protecting AI from sophisticated nation-state actors or well-resourced criminal organisations. The question now is whether AI is safe from anyone with a grudge, curiosity, or bad intent.

Reports of AI-related incidents rose 50% year-over-year from 2022 to 2024, and in the 10 months to October 2025, incidents had already surpassed the 2024 total, according to the AI Incident Database. The trajectory is not reassuring.

How Did Google and Meta Respond?

The responses from the two companies whose models were most prominently featured tell a revealing story.

Google said "abliteration is a known technical challenge facing all open models" and that its open models "undergo rigorous internal safety evaluations prior to launch to help prevent these kinds of troubling examples." Meta declined to comment. A person close to the company said it assesses its open-source models' capabilities before releasing them, according to its Advanced AI Scaling Framework. Versions deemed to pose a "catastrophic" risk are not released to the public unless Meta finds sufficient mitigation measures.

What is notable here is the tone. Google's statement is an acknowledgment, not a solution. Calling abliteration a "known technical challenge" is accurate but also reads as a concession that no fix is forthcoming. Meta's silence, meanwhile, is telling. When your flagship open-source AI model is used to calculate lethal ricin doses, declining to comment is a choice.

As for GitHub, which hosts Heretic, a spokesperson noted that hosting the source code for tools like Heretic is permitted because it provides a "net benefit to the security community." That framing may hold up in isolation, but it grows considerably more complicated when the tool in question has been used to generate 3,500 models downloaded 13 million times.

The Policy and Regulatory Fallout

The implications for regulation and enterprise use are enormous. The issue is now shifting from a technical debate into a much bigger conversation around enterprise liability, regulation, and AI governance.

For businesses that have deployed Llama or Gemma models in their products or internal systems under the assumption that those models were safe, this investigation is a rude awakening. If a model's safety layer can be cleanly scraped away in minutes, any assurance made at the point of procurement is worth very little.

The investigation demonstrates that safety constraints applied at the distribution layer cannot meaningfully restrict misuse of open-source models. Any AI company releasing open-source models now faces a structural credibility problem: safety commitments made at release can be nullified by anyone with an internet connection in under 10 minutes.

Policymakers who have been designing regulatory frameworks around evaluating AI models at the point of release now face a fundamental challenge. The 13 million download figure means decensored variants are already in widespread circulation, making current regulatory arguments about open-source safety controls functionally moot for existing model generations.

The revelations may sharpen concerns among policymakers and AI companies that safeguards imposed by model developers may become harder to enforce as open-source systems grow more powerful.

What Are the Possible Solutions?

There is no easy fix here, and it is worth being honest about that. Several approaches have been proposed, each with trade-offs.

One option being explored is training models on datasets from which dangerous material has been removed in the first place, rather than simply applying safety fine-tuning on top of an existing model. The idea is that if the dangerous knowledge was never in the model to begin with, abliteration would have nothing to unlock. However, one researcher noted that removing dangerous material could make models "naive" and unable to detect when they were being used for "malicious purposes," adding it was "not clear at all that if you omit the harmful data, the model becomes a goody two-shoes."

Another approach is to think about safety not at the model level, but at the infrastructure level. Runtime monitoring, application-layer guardrails, and continuous independent audits are all being discussed as ways to provide a second and third line of defence beyond what the model itself offers. The problem is that if someone is running a modified model locally with no intermediary infrastructure, none of these measures apply.

Some researchers have also pointed to the dual-use nature of abliteration as a reason not to ban the technique outright. These research-configured models operate under strict controls, in isolated environments, for specific evaluation purposes. They are more like crash test dummies than actual vehicles, specialised tools designed to help us understand safety dynamics, not to cause harm in the real world. Red-teaming, the practice of using modified models to test how other models withstand adversarial attack, is a legitimate and important part of AI safety research. Any policy response will need to grapple with that nuance.

The Bigger Picture: What This Means for AI's Future

There is a metaphor that keeps coming up in discussions about this issue, and it is apt. "The genie is out of the bottle," said Alice chief executive and co-founder Noam Schwartz. "Things that look like sci-fi are no longer sci-fi and we need as a society to prepare accordingly."

This is not a problem that is going to be solved by a patch or a policy memo. Open-source AI is now powerful enough that its misuse is a genuine public safety concern, and the tools to enable that misuse are freely available, require no expertise, and have already been downloaded millions of times.

The open-source AI community faces an increasingly uncomfortable reckoning. The values of transparency, democratisation, and open access that drove the movement are real and important. But so is the reality that those same values are being exploited to put dangerous capabilities in the hands of anyone who looks for them.

There is no clean resolution to this tension. The most honest thing that can be said is that we are in a period where the rate of capability development in open-source AI has outpaced our collective ability to govern it responsibly. The FT and Alice investigation has not created this problem. It has simply, and starkly, made it visible.

What is clear is that the AI safety debate is no longer abstract. It is not about hypothetical future systems or science-fiction scenarios. It is about models that exist today, running on ordinary computers, capable of producing genuinely dangerous outputs, and accessible to anyone willing to spend 10 minutes following instructions. The industry, regulators, and the public all need to be honest about that reality before meaningful progress can be made.

Comments

to join the discussion.