Tools that strip AI safety controls from open models are spreading: FT

Open-source artificial intelligence models are facing fresh scrutiny after tests showed that safety controls built into some systems could be stripped away quickly, allowing altered versions to answer harmful prompts.

MIT SMR Editors less than a minute ago

Topics

Software tools are being used to remove guardrails from AI models released by companies including Meta and Google, creating modified systems that can respond to prompts involving biological weapons, malware and child exploitation, the Financial Times reported on Tuesday.

The tests were conducted by the FT and AI safety group Alice.

The findings add to concerns that safeguards imposed by AI developers may be difficult to enforce once open models are downloaded, copied and modified by outside users.

Unlike closed systems such as ChatGPT or Anthropic’s Claude, open models can be accessed and adapted by developers, making post-release control harder.

The FT said one tool available on GitHub was used to remove protections from Meta’s Llama 3.3 model in less than 10 minutes, using freely available software and a small amount of code. The modified model then responded to prompts that the original system had refused.

Researchers told the newspaper that the risk has grown as open models become more capable and easier to adapt.

The report identified a technique known as “abliteration,” which can weaken or remove refusal behavior in open models. The method has become part of a wider ecosystem of “decensored” models hosted and shared online.

Alice chief executive and co-founder Noam Schwartz told FT that society needed to prepare for risks that had previously seemed remote. “The genie is out of the bottle,” he said.

The issue poses a difficult policy problem. Governments and AI companies have largely focused on safety testing and guardrails at the point of model development. But once open models are released, modified versions can circulate beyond the control of their original creators.

One proposed response is to remove dangerous material from training data before models are built. But that approach could have trade-offs, including making models less able to recognize malicious use.

Google told the FT that abliteration was a known challenge for all open models and said its systems undergo internal safety evaluations before release.

GitHub reportedly said it prohibits content that directly supports unlawful attacks or malware campaigns, while allowing some security-related code because it can have educational value.

Topics

About the Author

Tags:

Topics

Share