This article is part of our coverage of the latest in AI research.
Developers of frontier AI systems are constantly taking measures to harden their models against jailbreaking attacks. But Best-of-N (BoN) jailbreaking, a new technique developed by Speechmatics, MATS, and Anthropic, shows how difficult it is to close the safety gaps in large language models (LLMs).
BoN is a simple black-box algorithm that achieves high attack success rates (ASR) on private LLMs with a reasonable amount of prompts. It is also effective against vision language models (VLMs) and audio language models (ALMs), serving as a reminder of the sensitivity of models against innocuous attacks.
How BoN jailbreaking works
Say an attacker wants to send a harmful request to an LLM. If the request is sent in plain format, the model’s safeguards will probably detect and block it. BoN jailbreaking applies multiple augmentations to the harmful requests until one of them slips through the model’s defenses or a maximum number of requests are exhausted. The augmentations are made in ways that the original request and its intent remain recognizable.
BoN is a black-box technique, which means it does not need access to the model’s weights, making it applicable to private models such as GPT-4, Claude, and Gemini. It is also easy and fast to implement. It is also multi-modal, making it applicable to the latest models that support vision and audio.
Best-of-N jailbreaking in action
The researchers tested BoN jailbreaking on leading closed models such as Claude 3.5, GPT-4o, Gemini-1.5, as well as Llama-3.
They ran their tests using HarmBench, a dataset that is used to red-team LLMs against harmful requests. They consider a jailbreak successful if it forces the model to provide the user with information relevant to the harmful request, even if it is not complete and comprehensive.
Their experiments show that BoN Jailbreaking is an effective attack on frontier LLMs. Without augmentation, attack success rates on all models were under 1%. With 10,000 augmented samples, they achieved 78% ASR on Claude 3.5 Sonnet and 50% on Gemini Pro, which is impressive given the amount of safeguards implemented on the model. However, most successful attacks need much fewer examples. For example, 53% to 71% of the jailbreaks on Claude and GPT models only require sampling 100 of the augmented attacks. On Gemini, which proved to be the most resilient model, 100 samples were enough for 22% to 30% of the attacks. To put that into perspective, the costs would be around $9 on GPT-4o and $13 on Claude to run a successful attack with 100 examples.
Another interesting finding is that combining BoN with other jailbreaking techniques results in a steeper ascent of the ASR curve. Moreover, BoN proved to be resilient against popular defense mechanisms including circuit breakers and Gray Swan Cygnet.
BoN also works on non-text modalities. For example, in VLMs, BoN renders the text of the harmful instruction into an image and fills the background with random color blocks. The model is prompted to follow the instructions on the image. Audio attacks use AI-generated commands to trick the ALM into producing harmful content. BoN image and audio attacks are less successful than text attacks but are impressive nonetheless.
Why BoN works
While increasing the model’s temperature results in minor improvements, the variance caused by the augmentation operation mostly accounts for the success of BoN.
“This is empirical evidence that augmentations play a crucial role in the effectiveness of BoN, beyond mere resampling,” the researchers write. “We hypothesize that this is because they substantially increase the entropy of the effective output distribution, which improves the algorithm’s performance.”
The researchers suggest several ways the algorithm can be improved, such as rephrasing, adding ciphers, or using SVG format for image attacks.
“Overall, BoN Jailbreaking is a simple, effective, and scalable jailbreaking algorithm that successfully jailbreaks all of the frontier LLMs we considered,” the researchers write. “We thus see that despite the sophistication and advanced capabilities of frontier AI systems, their properties—stochastic outputs and sensitivity to variations in their high-dimensional input spaces—can be exploited by even simple attack algorithms.”