LLM Jailbreaks & Defenses
Automated jailbreaks, prompt attacks, and the defenses built to catch or contain them.
Explore
Browse research by topic, paper trail, and recurring adversarial ML themes.
Automated jailbreaks, prompt attacks, and the defenses built to catch or contain them.
How search, evolution, gradients, and black-box optimization discover failures in ML systems.
Security risks that emerge when AI systems can plan, call tools, write code, use memory, coordinate tasks, or act with partial autonomy.
Attacks against image classifiers, perception systems, and vision-model assumptions.
How reward functions, agents, environments, and evaluators can be gamed, exploited, or misaligned under optimization pressure.
Security risks in the AI software stack, from model dependencies and agent tools to package ecosystems, plugins, and deployment pipelines.