Blog Link: https://huggingface.co/spaces/lerobot/robot-folding
Tips for good data collection
- Practice before you record. Consistent, deliberate demonstrations are more valuable than hesitant or inconsistent ones
- Quality over speed. High-quality task execution is more valuable than fast, sloppy ones
- Be consistent within episodes. The model learns a coherent strategy more easily than movements that vary wildly each time
- Start small, then extend. Train a quick model, see what fails, then add diversity. Don’t try to collect the perfect dataset on day one
- Speed after quality. Once you’ve dialed in the quality and a consistent strategy, optimize for speed. But never sacrifice quality for it
- Watch your setup, not just your data. If rig vibrates or frustrates operators, fix that before collecting more
Architecture
Real-Time Chunking (RTC)
Training Recipe
Training ran on 8xH100 GPUs with per-gpu batch size of 32, used gradient accumulation, and using AdamW with a learning rate of
Large batch size is important for stable VLA training and drives multi-gpu requirement
Evaluation
Evals are very important because if your evals aren’t good every decision you make based off them will be wrong
Metrics
Metrics they gave for folding clothes:
- Success Rate; Binary pass/fail per rollout
- Score; Partial credit based on subtasks completed
- Fold Quality; A 1-5 rating of the final fold appearance, averaged across successful rollouts
- Completion Time; Seconds to complete Level 1/Level 2, averaged across successful rollouts
Data to Deployment
Started simply by training pi 0 and pi 0.5 on the full dataset 5,688 episodes for 200k steps each (took ~27 hours on 8xH100s)
Models could sometimes fold a laid-out shirt, but they were slow and produced poor-quality folds. They suspected the problem was in the data, different operators used different grip points and strategies for unspreading the shirt
Improving the data
- Removed demonstrations that didn’t show a properly folded shirt, if end result isn’t good demonstration isn’t useful
- Length-based filtering using lerobot data visualizer to remove outliers. Short episodes tend to be low quality
They trained a SARM, how do you measure “progress” in a long, multi-stage task like t-shirt folding
They annotated every episode in both datasets using SARM, this gave them continuous per-timestep quality scores that they could use in 2 ways: for data curation and for reward-weighted training