Paper Link: https://www.pi.website/download/pi05.pdf
Vision Language Action (VLA) model
VLA total params = Vision Language Model (VLM) params + action head params
Vision Language Action (VLA) models are initialized with the weights of the pre-trained VLM
Training
pi 0.5 paper goes very into detail about the training mixtures used to train the model which helps improve its generalizability, which I am not going to go into detail here
VLAs are trained via imitation learning on robot demonstrations dataset D, by maximizing log-likelihood of an action given an observation and natural language task
The observation typically contains one or more images
pi 0.5 is trained in 2 stages:
- pre-training stage to adapt the model to diverse robotic tasks
- post-training stage is intended to specialize it to mobile manipulation
post-training is when they add the action expert
Architecture
architecture can represent both action chunk distributions and tokenized text outputs
For subtask inference it’s written as
- - overall task prompt
- - high level subtask / answer to vla prompt in web-data
- - predicted action chunk
We can decompose this as
Action distribution does NOT depend on (overall prompt) but depends on the high level subtask
Model takes in N multimodal input tokens (), token implies discretized and continuous inputs
We produce a sequence of multimodal outputs , which we write as
- - token type
- - Attention Mask, tells model which tokens are allowed to attend to which ones
- Attention Mask (Detail)
- Prefix block (P)
- P tokens attend bidirectionally to each other (images/prompt/state can interact freely)
The model outputs both text and actions the output of is split into text token logits and action output tokens ()
First M are text token logits that can be used to sample , later H tokens are produced by a separate action expert