AML.T0054 — LLM Jailbreak

An adversary may target the inputs or the architecture of an LLM, placing it in a state where it will freely respond to user input, bypassing any controls, restrictions, or guardrails placed on the LLM. Once successfully jailbroken, the LLM can be used in unintended ways by the adversary.

Jailbreaks are classified as either white-box or black-box depending on the level of model access they require. In white-box jailbreaks, the attacker directly exploits the model. Gradient and logit-based attacks use internal signals to select prompts that make harmful answers more likely while fine-tuning-based methods weaken safety through retraining ([Jailbreak Attacks and Defenses: A Survey]; [JailbreakZoo]). Additionally, many large language models encode refusal in a single linear signal in their internal layers; removing or suppressing this signal via an activation edit largely disables refusal while leaving normal skills mostly intact ([Refusal Direction]).

Black-box jailbreaks, on the other hand, do not require direct model access, relying instead on clever prompting and context tricks. Examples include wrapping a harmful question inside a story, role-play, or code snippet so the model "fills in" the dangerous part as part of the scenario ([Jailbreak Attacks and Defenses: A Survey]) or hiding intent through ciphers, rare languages, and other encodings so the text looks harmless to filters but remains understandable to the model ([JailbreakZoo]; [Jailbreak Attacks and Defenses: A Survey]). Black-box jailbreaks can also take a multi-turn form where attackers engage in dialogue that begins harmless but escalates towards a forbidden goal, changing the conversation history through references and hints while never explicitly stating the malicious request ([Cresendo attacks], [Echo Chamber attack]).

Jailbreak generation can be automated with fine-tuned models ([MASTERKEY]), multi-agent systems ([Jailbreak Attacks and Defenses: A Survey]), or evolutionary algorithms ([JailbreakZoo]). Aside from fine-tuning, these approaches do not require direct model access.