Paper Link: https://arxiv.org/abs/2509.25358
SARM stands for Stage-Aware Reward Modeling.
It is a video-based reward modeling framework for long-horizon robot manipulation tasks, especially tasks where demonstrations vary a lot in speed and quality.
Main Idea
Instead of treating progress as “how far into the video we are”, SARM tries to predict:
- what stage of the task the robot is currently in
- how much progress has been made within that stage
This matters because long-horizon tasks like folding a shirt do not progress at a fixed speed. One demonstration might spend a long time on one step while another finishes that same step quickly.
Why not use frame index as progress
Frame-index based labeling assumes that if you are 60% through the video then you are 60% through the task.
That breaks down when:
- demonstrations have different lengths
- the robot hesitates or makes corrections
- some trajectories contain low-quality behavior
- tasks have natural stages with very different durations
SARM replaces that with semantically meaningful progress signals.
How labels are made
SARM uses natural-language subtask annotations to define stages of the task.
For example, a long-horizon task might be broken into subtasks like:
- pick up shirt
- align sleeves
- fold left side
- fold right side
Those stage labels are more consistent across demonstrations than raw frame position.
The key point is that the labels are semantic, not purely temporal.
So if two demonstrations both reach the “fold right side” step at very different times, SARM can still align them to the same stage.
What the model predicts
The reward model jointly predicts:
- task stage
- fine-grained progress inside the stage
So instead of only saying “the robot is halfway done”, it can say something closer to “the robot is in the sleeve alignment stage and making progress within that stage.”
You can think of it as a structured progress estimator:
- stage prediction gives coarse task understanding
- within-stage progress gives a smoother dense signal
Why this helps policy training
Once SARM can estimate progress from video, that reward signal can be used to train a policy more intelligently.
One use is RA-BC Weights:
- action chunks with stronger progress get larger weights
- weak or noisy parts of demonstrations matter less
- the policy focuses more on behavior that actually advances the task
The same reward model can also be used as a dense reward signal for reinforcement learning.
Why this is better than vanilla BC
Vanilla behavior cloning only asks:
- what action did the demonstrator take here?
SARM adds another question:
- was this part of the trajectory actually useful for finishing the task?
That matters a lot in long-horizon manipulation because demonstrations often include:
- pauses
- recovery motions
- failed grasps
- inefficient detours
With SARM, the policy does not have to imitate all of that equally.
Results
In the paper, SARM + RA-BC greatly outperformed vanilla behavior cloning on T-shirt folding:
- 83% success from the flattened state
- 67% success from the crumpled state
- vanilla BC only achieved 8% and 0% on those same settings
Intuition
SARM is basically a way to turn messy demonstration videos into a cleaner notion of task progress.
That is useful for long-horizon manipulation because the biggest problem is often not lack of data, but inconsistent data quality.