Stage-Aware Reward Modeling

Paper Link: https://arxiv.org/abs/2509.25358

SARM stands for Stage-Aware Reward Modeling.

It is a video-based reward modeling framework for long-horizon robot manipulation tasks, especially tasks where demonstrations vary a lot in speed and quality.

Main Idea

Instead of treating progress as “how far into the video we are”, SARM tries to predict:

what stage of the task the robot is currently in
how much progress has been made within that stage

This matters because long-horizon tasks like folding a shirt do not progress at a fixed speed. One demonstration might spend a long time on one step while another finishes that same step quickly.

Why not use frame index as progress

Frame-index based labeling assumes that if you are 60% through the video then you are 60% through the task.

That breaks down when:

demonstrations have different lengths
the robot hesitates or makes corrections
some trajectories contain low-quality behavior
tasks have natural stages with very different durations

SARM replaces that with semantically meaningful progress signals.

How labels are made

SARM uses natural-language subtask annotations to define stages of the task.

For example, a long-horizon task might be broken into subtasks like:

pick up shirt
align sleeves
fold left side
fold right side

Those stage labels are more consistent across demonstrations than raw frame position.

The key point is that the labels are semantic, not purely temporal.

So if two demonstrations both reach the “fold right side” step at very different times, SARM can still align them to the same stage.

What the model predicts

The reward model jointly predicts:

task stage
fine-grained progress inside the stage

So instead of only saying “the robot is halfway done”, it can say something closer to “the robot is in the sleeve alignment stage and making progress within that stage.”

You can think of it as a structured progress estimator:

stage prediction gives coarse task understanding
within-stage progress gives a smoother dense signal

Why this helps policy training

Once SARM can estimate progress from video, that reward signal can be used to train a policy more intelligently.

One use is RA-BC Weights:

action chunks with stronger progress get larger weights
weak or noisy parts of demonstrations matter less
the policy focuses more on behavior that actually advances the task

The same reward model can also be used as a dense reward signal for reinforcement learning.

Why this is better than vanilla BC

Vanilla behavior cloning only asks:

what action did the demonstrator take here?

SARM adds another question:

was this part of the trajectory actually useful for finishing the task?

That matters a lot in long-horizon manipulation because demonstrations often include:

pauses
recovery motions
failed grasps
inefficient detours

With SARM, the policy does not have to imitate all of that equally.

Results

In the paper, SARM + RA-BC greatly outperformed vanilla behavior cloning on T-shirt folding:

83% success from the flattened state
67% success from the crumpled state
vanilla BC only achieved 8% and 0% on those same settings

Intuition

SARM is basically a way to turn messy demonstration videos into a cleaner notion of task progress.

That is useful for long-horizon manipulation because the biggest problem is often not lack of data, but inconsistent data quality.

Ayush Garg

Recently Updated

Pareto Principle

Bits

Magnitude of a normalized floating-point number

Mixed Precision Training

Stage-Aware Reward Modeling

Main Idea

Why not use frame index as progress

How labels are made

What the model predicts

Why this helps policy training

Why this is better than vanilla BC

Results

Intuition

Graph View

Table of Contents

Backlinks

Ayush Garg

Recently Updated

Pareto Principle

Bits

Magnitude of a normalized floating-point number

Mixed Precision Training

Stage-Aware Reward Modeling

Main Idea §

Why not use frame index as progress §

How labels are made §

What the model predicts §

Why this helps policy training §

Why this is better than vanilla BC §

Results §

Intuition §

Graph View

Table of Contents

Backlinks

Main Idea

Why not use frame index as progress

How labels are made

What the model predicts

Why this helps policy training

Why this is better than vanilla BC

Results

Intuition