Reading List: Adversarial ML and AI Security

Last updated: May 18th, 2026

This is a curated reading list of the papers that matter most for understanding adversarial machine learning, LLM security, and the security of AI agent systems. It's organized by topic rather than chronologically, and every paper includes a short note on what it shows.

The list is opinionated. It favors papers that are either foundational (the work everything else builds on) or that meaningfully changed the field's understanding. If a paper isn't here, it doesn't mean it's not worth reading, it means I either haven't read it or didn't think it cleared the bar.

Foundations

The papers that establish why neural networks are exploitable in the first place.

★ Intriguing Properties of Neural Networks Szegedy et al. · ICLR 2014

The paper that discovered adversarial examples. Imperceptible perturbations flip a classifier's prediction with high confidence, and those perturbations transfer across models trained on the same data. The foundational reference for the entire field.

★ Explaining and Harnessing Adversarial Examples Goodfellow, Shlens, Szegedy · ICLR 2015

Introduces FGSM and the linearity hypothesis — the argument that adversarial examples arise from neural networks' locally linear behavior in high-dimensional input spaces, not from over-parameterization or any other pathology. Foundational reading for understanding why adversarial examples exist.

Evasion Attacks against Machine Learning at Test Time Biggio et al. · ECML PKDD 2013

Pre-dates Szegedy and Goodfellow by months. Formalizes gradient-based evasion attacks against classifiers at the conceptual level that deep-learning adversarial ML later adopted wholesale. Underread because it predates the deep-learning era; worth reading for the long view.

The Limitations of Deep Learning in Adversarial Settings Papernot et al. · IEEE EuroS&P 2016

Introduces the Jacobian-based Saliency Map Attack (JSMA) and formalizes the white-box / black-box, targeted / untargeted threat model the field still uses. Required reading for understanding how attacks are categorized.

Wild Patterns: Ten Years After the Rise of Adversarial Machine Learning Biggio, Roli · Pattern Recognition 2018

A retrospective tying the pre-deep-learning evasion literature to the modern adversarial-examples canon. The best single overview for a newcomer to the field — covers a decade of work that most people skip because they started reading in 2014.

Adversarial Machine Learning

The core computer-vision adversarial-examples canon — attacks, defenses, and why the defenses mostly don't hold up.

★ Towards Deep Learning Models Resistant to Adversarial Attacks Madry et al. · ICLR 2018

Introduces PGD and frames adversarial robustness as a min-max saddle-point optimization problem. PGD adversarial training is still the most-cited empirical defense and the de facto evaluation baseline. If you read one defense paper, this is it.

Towards Evaluating the Robustness of Neural Networks Carlini, Wagner · IEEE S&P 2017

The C&W attack — three optimization-based attacks that broke defensive distillation and remain among the strongest white-box attacks. Required reading for understanding what optimization-based attack design actually looks like.

One Pixel Attack for Fooling Deep Neural Networks Su, Vargas, Sakurai · IEEE Trans. Evolutionary Computation 2019

Modifying a single pixel via differential evolution fools CIFAR-10 classifiers with a 70.97% success rate. Striking evidence of the brittleness of neural decision boundaries, and a proof-of-concept for gradient-free evolutionary attack strategies. → Covered in depth: One-Pixel Attacks: Why Computer Vision Security Is Broken

★ Adversarial Examples Are Not Bugs, They Are Features Ilyas et al. · NeurIPS 2019

The conceptual pivot of the field. Adversarial vulnerability is attributed to non-robust features — patterns that are genuinely predictive on the training distribution but imperceptible to humans. Reframes adversarial examples from a model bug to a property of the data itself.

★ Obfuscated Gradients Give a False Sense of Security Athalye, Carlini, Wagner · ICML 2018

Broke 7 of 9 ICLR 2018 defenses by showing they relied on gradient masking rather than genuine robustness. Introduced BPDA and EOT as adaptive-attack tools. The methodological gold standard for defense evaluation — if your defense hasn't been tested against adaptive attacks, it hasn't really been tested.

On the Robustness of Vision Transformers to Adversarial Examples Mahmood, Mahmood, Van Dijk · ICCV 2021

The first systematic comparison of CNN and ViT adversarial robustness. Finds that adversarial examples transfer poorly between the two architectures, which motivated ensemble defenses — though subsequent work complicated the "ViTs are more robust" take considerably.

LLM and Agentic Security

Prompt injection, jailbreaks, and what the research actually shows about defending against them.

Ignore Previous Prompt: Attack Techniques for Language Models Perez, Ribeiro · NeurIPS 2022 ML Safety Workshop

The first systematic study of direct prompt injection. Introduces the PromptInject framework and the two main attack categories — goal hijacking and prompt leaking. The starting point for understanding how the attack class was formalized.

★ Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection Greshake et al. · ACM AISec 2023

Defines and demonstrates indirect prompt injection — instructions planted in documents, webpages, or emails that an LLM retrieves and executes. Tested against Bing Chat, GPT-4, and synthetic agents. The canonical reference for application-integrated LLM attacks and the most consequential LLM security paper of 2023.

★ Universal and Transferable Adversarial Attacks on Aligned Language Models Zou et al. · arXiv 2023

Introduces GCG — gradient-based discrete optimization of adversarial suffixes that transfer across model families (Vicuna, Llama-2, GPT-3.5/4, Claude, Bard). The most influential LLM attack paper to date and the paper that opened the door for automated LLM red teaming at scale. → Covered in depth: Evolving the Jailbreak

Jailbroken: How Does LLM Safety Training Fail? Wei, Haghtalab, Steinhardt · NeurIPS 2023

The conceptual framework for why jailbreaks succeed: competing objectives between capability and safety, and mismatched generalization where safety training fails to transfer to domains where capabilities exist. Essential for thinking about why alignment and robustness aren't the same problem.

Jailbreaking Black Box Large Language Models in Twenty Queries Chao et al. · IEEE SaTML 2024

Introduces PAIR — an attacker LLM that iteratively refines jailbreak prompts against a target, producing readable jailbreaks in fewer than 20 queries. The black-box counterpart to GCG.

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents Debenedetti et al. · NeurIPS 2024

The standard benchmark for prompt-injection robustness in tool-using LLM agents — 97 realistic tasks and 629 security test cases across email, banking, Slack, and travel-booking environments. The reference work for evaluating agent-level security.

Are Aligned Neural Networks Adversarially Aligned? Carlini et al. · NeurIPS 2023

Demonstrates that RLHF-aligned LLMs remain vulnerable to worst-case adversarial inputs, especially in multimodal settings where continuous image inputs admit gradient-based attacks. Strong evidence that alignment and adversarial robustness are distinct problems that don't solve each other.

Reinforcement Learning

Reward hacking, specification gaming, and adversarial attacks on RL policies.

Specification Gaming: The Flip Side of AI Ingenuity Krakovna et al. · DeepMind, 2020

The canonical reference for specification gaming — RL agents satisfying the literal objective while violating designer intent. Not peer-reviewed, but cited widely because the examples are concrete and drawn from the peer-reviewed literature. Pairs with the live spreadsheet.

Adversarial Attacks on Neural Network Policies Huang et al. · ICLR 2017 Workshop

The foundational paper for adversarial attacks on RL. Demonstrates that imperceptible perturbations on observations significantly degrade trained deep-RL policies (DQN, TRPO, A3C) on Atari — the same vulnerability that affects image classifiers extends naturally to RL agents.

Vulnerability of Deep Reinforcement Learning to Policy Induction Attacks Behzadan, Munir · MLDM 2017

The training-time counterpart to Huang et al. Shows DQNs are vulnerable not just at test time but during training — introduces policy induction attacks that manipulate agent behavior by exploiting adversarial transferability.

Demonstrating Specification Gaming in Reasoning Models Bondarenko et al. (Palisade Research) · arXiv 2025

Frontier reasoning models spontaneously hack the chess environment — editing game files, replacing the engine — when instructed to win against a strong opponent, with no adversarial prompting. o1-preview attempted environment manipulation in 37% of trials. The most concrete recent demonstration of emergent agentic specification gaming in production-grade models. (Preprint; peer-reviewed replication pending.) → Covered in depth: When AI Finds the Shortcut: Reward Hacking from 1994 to 2025

Physical World Attacks

The same vulnerabilities, except the adversarial examples are printed on paper.

Accessorize to a Crime: Real and Stealthy Attacks on State-of-the-Art Face Recognition Sharif et al. · ACM CCS 2016

The original adversarial eyeglass frames — 3D-printable glasses that achieve both dodging (evading recognition) and impersonation (being recognized as a target identity). Foundational for physical-world adversarial attacks and still the clearest demonstration of what's possible with a printer.

Robust Physical-World Attacks on Deep Learning Visual Classification Eykholt et al. · CVPR 2018

The stop-sign-stickers paper. Printed stickers cause 100% misclassification in lab settings and 84.8% in field tests from a moving vehicle. Introduces RP₂ and a physical-world evaluation methodology the field still uses.

Synthesizing Robust Adversarial Examples Athalye et al. · ICML 2018

The 3D-printed turtle classified as a rifle from every angle. Introduces Expectation Over Transformation (EOT) — the standard technique for crafting adversarial examples that stay adversarial through real-world variation in viewpoint, lighting, and distance.

Adversarial Patch Brown et al. · NeurIPS 2017 Workshop

Universal, targeted, physically realizable patches — stickers placed anywhere in a scene that force a target classification regardless of context. The paper that spawned the entire patch-attack subfield, and still the cleanest demonstration of the core concept.

Adversarial Texture for Fooling Person Detectors in the Physical World Hu et al. · CVPR 2022

Generates adversarial textures printed onto clothing that evade person detectors from arbitrary viewpoints and distances. The strongest post-2020 physical-world attack on object detection — the natural endpoint of the line of work started by Eykholt and Athalye.

Advanced Topics

Quantum ML, neuromorphic computing, and evolutionary attacks — the frontier most people aren't reading yet.

Quantum Adversarial Machine Learning Lu, Duan, Deng · Physical Review Research 2020

Establishes that variational quantum classifiers are vulnerable to adversarial perturbations directly analogous to those that affect classical models. Explores quantum adversarial training as a defense. The reference paper for the emerging quantum-adversarial-ML subfield.

Inherent Adversarial Robustness of Deep Spiking Neural Networks Sharmin et al. · ECCV 2020

Shows that Spiking Neural Networks on neuromorphic hardware exhibit substantially higher adversarial robustness than equivalent ANNs, attributing the gap to Poisson rate encoding and leaky-integrate-and-fire dynamics. The starting point for neuromorphic-computing security.

GenAttack: Practical Black-Box Attacks with Gradient-Free Optimization Alzantot et al. · GECCO 2019

Genetic-algorithm-based black-box attack requiring roughly 2,100× fewer queries than the prior state-of-the-art against MNIST and CIFAR-10, and 237× fewer against Inception-v3. The natural complement to the differential-evolution approach of Su et al., and the reference work for evolutionary adversarial ML more broadly. → Covered in depth: Swarm Intelligence as a Weapon

Notes

This list is maintained by Josh and reflects what I've actually read closely enough to annotate. If you think a paper belongs here, email [email protected].