RA-BC stands for Reward-Aligned Behavior Cloning.
Normal behavior cloning treats every training example the same, but RA-BC assigns a weight to each example based on how good that action or action chunk is according to a reward model.
One way to write the objective is:
- is the RA-BC weight for example
- larger means that example has more influence during training
- examples with better reward get larger weights
Intuition
- good demonstrations should matter more than weak demonstrations
- the policy should copy actions that actually move the task forward
- this helps when the dataset has mixed-quality trajectories
In SARM
In SARM, the reward model predicts stage-aware rewards for long-horizon robot manipulation.
Those predicted rewards are then turned into RA-BC weights, so the policy focuses more on action chunks that better align with task progress.