Adversarial training is a method for making a model more robust by training it not just on normal examples, but also on intentionally perturbed inputs designed to fool it.

These perturbed inputs are called adversarial examples.

Core idea

Instead of only minimizing the loss on clean training data, adversarial training also tries to minimize the loss on the “worst” small perturbations of that data.

So during training:

  • start with a normal input
  • generate a modified version that makes the model more likely to fail
  • train the model on both the clean and adversarial versions

This makes the model learn decision boundaries that are less fragile.

Why it works

Many machine learning models, especially Neural Networks, can be sensitive to tiny input changes that are almost invisible to humans but still cause wrong predictions.

Adversarial training exposes the model to these difficult cases during learning, so the model becomes harder to attack.

In that sense, adversarial training is similar to a targeted form of data augmentation, except the added examples are chosen specifically to be difficult for the current model.

Optimization view

Ordinary training tries to reduce prediction error on the training set.

Adversarial training is closer to a min-max problem:

  • an inner step searches for a perturbation that increases the loss
  • an outer step updates the model parameters to reduce that worst-case loss

The outer update is still usually done with methods based on Gradient Descent.

Tradeoffs

Adversarial training can:

  • improve robustness to adversarial attacks
  • sometimes improve stability under small input shifts
  • significantly increase training cost because adversarial examples must be generated during training
  • sometimes reduce clean-data accuracy if robustness is pushed too aggressively

Important distinction

Adversarial training usually refers to robustness against adversarial examples.

This is different from training one model against another in the style of a generative adversarial network, where two models are explicitly optimized against each other.