For each stage SARM looks across the whole training dataset and estimates what fraction of an episode’s total duration that stage usually takes (on average)

That average fraction is

This is a global statistic computed from many demonstrations of the same task

eg:

  • “Reach Shirt” usually takes
  • “Grasp Shirt” usually takes
  • “Fold Shirt” usually takes

Then for a frame in stage 2 (“fold shirt”) with the total progress is:

The normalized progress is 0.61 in and that is the point of the conversion