Blog Link: https://huggingface.co/spaces/lerobot/robot-folding

Tips for good data collection

  1. Practice before you record. Consistent, deliberate demonstrations are more valuable than hesitant or inconsistent ones
  2. Quality over speed. High-quality task execution is more valuable than fast, sloppy ones
  3. Be consistent within episodes. The model learns a coherent strategy more easily than movements that vary wildly each time
  4. Start small, then extend. Train a quick model, see what fails, then add diversity. Don’t try to collect the perfect dataset on day one
  5. Speed after quality. Once you’ve dialed in the quality and a consistent strategy, optimize for speed. But never sacrifice quality for it
  6. Watch your setup, not just your data. If rig vibrates or frustrates operators, fix that before collecting more

Architecture

Real-Time Chunking (RTC)

Training Recipe

Training ran on 8xH100 GPUs with per-gpu batch size of 32, used gradient accumulation, and using AdamW with a learning rate of (warmup + cosine decay)

Large batch size is important for stable VLA training and drives multi-gpu requirement

Evaluation

Evals are very important because if your evals aren’t good every decision you make based off them will be wrong

Metrics

Metrics they gave for folding clothes:

  1. Success Rate; Binary pass/fail per rollout
  2. Score; Partial credit based on subtasks completed
  3. Fold Quality; A 1-5 rating of the final fold appearance, averaged across successful rollouts
  4. Completion Time; Seconds to complete Level 1/Level 2, averaged across successful rollouts

Data to Deployment

Started simply by training pi 0 and pi 0.5 on the full dataset 5,688 episodes for 200k steps each (took ~27 hours on 8xH100s)

Models could sometimes fold a laid-out shirt, but they were slow and produced poor-quality folds. They suspected the problem was in the data, different operators used different grip points and strategies for unspreading the shirt

Improving the data

  1. Removed demonstrations that didn’t show a properly folded shirt, if end result isn’t good demonstration isn’t useful
  2. Length-based filtering using lerobot data visualizer to remove outliers. Short episodes tend to be low quality

They trained a SARM, how do you measure “progress” in a long, multi-stage task like t-shirt folding

They annotated every episode in both datasets using SARM, this gave them continuous per-timestep quality scores that they could use in 2 ways: for data curation and for reward-weighted training