One long sentence is all it takes to make LLMs

Security researchers from Palo Alto Networks’ Unit 42 have discovered the key to getting large language model (LLM) chatbots to ignore their guardrails, and it’s quite simple.

You just have to ensure that your prompt uses terrible grammar and is one massive run-on sentence like this one which includes all the information before any full stop which would give the guardrails a chance to kick in before the jailbreak can take effect and guide the model into providing a “toxic” or otherwise verboten response the developers had hoped would be filtered out.

The paper also offers a “logit-gap” analysis approach as a potential benchmark for protecting models against such attacks.

“Our research introduces a critical concept: the refusal-affirmation logit gap,” researchers Tung-Ling “Tony” Li and Hongliang Liu explained in a Unit 42 blog post. “This refers to the idea that the training process isn’t actually eliminating the potential for a harmful response – it’s just making it less likely. There remains potential for an attacker to ‘close the gap,’ and uncover a harmful response after all.”

LLMs, the technology underpinning the current AI hype wave, don’t do what they’re usually presented as doing. They have no innate understanding, they do not think or reason, and they have no way of knowing if a response they provide is truthful or, indeed, harmful. They work based on statistical continuation of token streams, and everything else is a user-facing patch on top.

Guardrails that prevent an LLM from providing harmful responses – instructions on making a bomb, for example, or other content that would get the company in legal bother – are often implemented as “alignment training,” whereby a model is trained to provide strongly negative continuation scores – “logits” – to tokens that would result in an unwanted response. This turns out to be easy to bypass, though, with the researchers reporting an 80-100 percent success rate for “one-shot” attacks with “almost no prompt-specific tuning” against a range of popular models including Meta’s Llama, Google’s Gemma, and Qwen 2.5 and 3 in sizes up to 70 billion parameters.

The key is run-on sentences. “A practical rule of thumb emerges,” the team wrote in its research paper. “Never let the sentence end – finish the jailbreak before a full stop and the safety model has far less opportunity to re-assert itself. The greedy suffix concentrates most of its gap-closing power before the first period. Tokens that extend an unfinished clause carry mildly positive [scores]; once a sentence-endi

» …
Read More

One long sentence is all it takes to make LLMs

Recent Posts

Recent Comments

Stay Updated with Tech Actual