Blog

Augmentation-based jailbreaking reveals critical flaws in AI models

December 30, 2024

llm jailbreak — Image created with Ideogram

This article is part of our coverage of the latest in AI research.

Developers of frontier AI systems are constantly taking measures to harden their models against jailbreaking attacks. But Best-of-N (BoN) jailbreaking, a new technique developed by Speechmatics, MATS, and Anthropic, shows how difficult it is to close the safety gaps in large language models (LLMs).

BoN is a simple black-box algorithm that achieves high attack success rates (ASR) on private LLMs with a reasonable amount of prompts. It is also effective against vision language models (VLMs) and audio language models (ALMs), serving as a reminder of the sensitivity of models against innocuous attacks.

How BoN jailbreaking works

Say an attacker wants to send a harmful request to an LLM. If the request is sent in plain format, the model’s safeguards will probably detect and block it. BoN jailbreaking applies multiple augmentations to the harmful requests until one of them slips through the model’s defenses or a maximum number of requests are exhausted. The augmentations are made in ways that the original request and its intent remain recognizable.

BoN is a black-box technique, which means it does not need access to the model’s weights, making it applicable to private models such as GPT-4, Claude, and Gemini. It is also easy and fast to implement. It is also multi-modal, making it applicable to the latest models that support vision and audio.

Best-of-N jailbreaking in action

The researchers tested BoN jailbreaking on leading closed models such as Claude 3.5, GPT-4o, Gemini-1.5, as well as Llama-3.

They ran their tests using HarmBench, a dataset that is used to red-team LLMs against harmful requests. They consider a jailbreak successful if it forces the model to provide the user with information relevant to the harmful request, even if it is not complete and comprehensive.

Their experiments show that BoN Jailbreaking is an effective attack on frontier LLMs. Without augmentation, attack success rates on all models were under 1%. With 10,000 augmented samples, they achieved 78% ASR on Claude 3.5 Sonnet and 50% on Gemini Pro, which is impressive given the amount of safeguards implemented on the model. However, most successful attacks need much fewer examples. For example, 53% to 71% of the jailbreaks on Claude and GPT models only require sampling 100 of the augmented attacks. On Gemini, which proved to be the most resilient model, 100 samples were enough for 22% to 30% of the attacks. To put that into perspective, the costs would be around $9 on GPT-4o and $13 on Claude to run a successful attack with 100 examples.

Another interesting finding is that combining BoN with other jailbreaking techniques results in a steeper ascent of the ASR curve. Moreover, BoN proved to be resilient against popular defense mechanisms including circuit breakers and Gray Swan Cygnet.

best-of-n-modalities — Different modalities of Best-of-N jailbreaking (source: arXiv)

BoN also works on non-text modalities. For example, in VLMs, BoN renders the text of the harmful instruction into an image and fills the background with random color blocks. The model is prompted to follow the instructions on the image. Audio attacks use AI-generated commands to trick the ALM into producing harmful content. BoN image and audio attacks are less successful than text attacks but are impressive nonetheless.

Why BoN works

While increasing the model’s temperature results in minor improvements, the variance caused by the augmentation operation mostly accounts for the success of BoN.

“This is empirical evidence that augmentations play a crucial role in the effectiveness of BoN, beyond mere resampling,” the researchers write. “We hypothesize that this is because they substantially increase the entropy of the effective output distribution, which improves the algorithm’s performance.”

The researchers suggest several ways the algorithm can be improved, such as rephrasing, adding ciphers, or using SVG format for image attacks.

“Overall, BoN Jailbreaking is a simple, effective, and scalable jailbreaking algorithm that successfully jailbreaks all of the frontier LLMs we considered,” the researchers write. “We thus see that despite the sophistication and advanced capabilities of frontier AI systems, their properties—stochastic outputs and sensitivity to variations in their high-dimensional input spaces—can be exploited by even simple attack algorithms.”

Understanding OpenAI’s pivot to releasing open source models

What to know about Google Gemini 2.5 Pro

How Open-Sora 2.0 cuts the costs of AI video generation without…

Google closes down on OpenAI with huge Gemini and Gemma 3…

How OpenAI is building its moat

How looking differently at data can save your machine learning project

Building a solid data foundation for generative AI applications

These mistakes can ruin your machine learning project

How to choose metrics for your machine learning projects

How to start with machine learning in your organization

What is Model Context Protocol (MCP)?

What to know about Claude 3.7 Sonnet, Anthropic’s new frontier language…

Everything you need to know about Grok-3

Understanding LLM ensembles and mixture-of-agents (MoA)

Demystifying DeepSeek-R1, the model that shocked the AI industry

The cash flow conundrum: How technology is reshaping small business finance

What to know about the security of open-source machine learning models

Understanding the impact of open-source language models

What we learned from the deep learning revolution

AI21 Labs’ mission to make large language models get their facts…

Augmentation-based jailbreaking reveals critical flaws in AI models

How BoN jailbreaking works

Best-of-N jailbreaking in action

Why BoN works

Like this:

Leave a ReplyCancel reply

How BoN jailbreaking works

Best-of-N jailbreaking in action

Why BoN works

Like this:

Leave a ReplyCancel reply

Discover more from TechTalks