Online reinforcement learning is Reinforcement Learning where the agent improves by interacting with the environment while training.
The agent:
- observes the current state
- chooses an action using its current Policy
- receives a reward and next state from the environment
- updates the policy or value function using that experience
- repeats the process with the updated policy
The key idea is that the agent’s own behavior determines what data it collects.
Training loop
At time step
The agent samples an action from its policy:
Then the environment returns:
The agent uses this transition to update its policy, value function, or both.
Online vs offline RL
| Type | Data source | Main issue |
|---|---|---|
| Online RL | data collected by the current training policy | exploration and unstable feedback loops |
| Offline RL | fixed dataset collected before training | distribution shift when policy chooses actions not covered by the dataset |
Online RL can discover better behavior through trial and error, but it must safely handle bad actions during learning.
Offline RL avoids live trial-and-error, but it is limited by the quality and coverage of the dataset.
On-policy and off-policy online RL
Online RL can be either on-policy or off-policy.
On-policy methods learn from data collected by the current policy.
Examples:
Off-policy methods can learn from data collected by older policies or a replay buffer.
Examples:
- Deep Q Learning
- Q-learning-style algorithms
Why online RL is hard
Online RL is difficult because:
- rewards can be sparse or delayed
- credit assignment is hard in long-horizon tasks
- the training data distribution changes as the policy changes
- exploration can produce unsafe or low-quality actions
- small policy updates can compound into large behavior changes
- learning can be sample inefficient because many environment interactions may be needed
This is why practical online RL often uses stabilizing techniques such as replay buffers, reward shaping, value functions, advantage estimation, entropy bonuses, or conservative policy updates.
Online RL is a better fit when interaction is cheap, fast, and safe. It is harder to apply when each action is expensive, slow, dangerous, or requires human supervision.