Online Reinforcement Learning

Online reinforcement learning is Reinforcement Learning where the agent improves by interacting with the environment while training.

The agent:

observes the current state
chooses an action using its current Policy
receives a reward and next state from the environment
updates the policy or value function using that experience
repeats the process with the updated policy

The key idea is that the agent’s own behavior determines what data it collects.

Training loop

At time step :

The agent samples an action from its policy:

Then the environment returns:

The agent uses this transition to update its policy, value function, or both.

Online vs offline RL

Type	Data source	Main issue
Online RL	data collected by the current training policy	exploration and unstable feedback loops
Offline RL	fixed dataset collected before training	distribution shift when policy chooses actions not covered by the dataset

Online RL can discover better behavior through trial and error, but it must safely handle bad actions during learning.

Offline RL avoids live trial-and-error, but it is limited by the quality and coverage of the dataset.

On-policy and off-policy online RL

Online RL can be either on-policy or off-policy.

On-policy methods learn from data collected by the current policy.

Examples:

Off-policy methods can learn from data collected by older policies or a replay buffer.

Examples:

Deep Q Learning
Q-learning-style algorithms

Why online RL is hard

Online RL is difficult because:

rewards can be sparse or delayed
credit assignment is hard in long-horizon tasks
the training data distribution changes as the policy changes
exploration can produce unsafe or low-quality actions
small policy updates can compound into large behavior changes
learning can be sample inefficient because many environment interactions may be needed

This is why practical online RL often uses stabilizing techniques such as replay buffers, reward shaping, value functions, advantage estimation, entropy bonuses, or conservative policy updates.

Online RL is a better fit when interaction is cheap, fast, and safe. It is harder to apply when each action is expensive, slow, dangerous, or requires human supervision.

Ayush Garg

Recently Updated

Group Relative Policy Optimizations

KL Divergence

Policy Gradient

Highly Opinionated Advice on How to Write ML Papers

Online Reinforcement Learning

Training loop

Online vs offline RL

On-policy and off-policy online RL

Why online RL is hard

Graph View

Table of Contents

Backlinks

Ayush Garg

Recently Updated

Group Relative Policy Optimizations

KL Divergence

Policy Gradient

Highly Opinionated Advice on How to Write ML Papers

Online Reinforcement Learning

Training loop §

Online vs offline RL §

On-policy and off-policy online RL §

Why online RL is hard §

Graph View

Table of Contents

Backlinks

Training loop

Online vs offline RL

On-policy and off-policy online RL

Why online RL is hard