Paper Link: https://www.pi.website/download/pi05.pdf

Vision Language Action (VLA) model

VLA total params = Vision Language Model (VLM) params + action head params

Vision Language Action (VLA) models are initialized with the weights of the pre-trained VLM

Training

pi 0.5 paper goes very into detail about the training mixtures used to train the model which helps improve its generalizability, which I am not going to go into detail here

VLAs are trained via imitation learning on robot demonstrations dataset D, by maximizing log-likelihood of an action given an observation and natural language task

The observation typically contains one or more images

pi 0.5 is trained in 2 stages:

  1. pre-training stage to adapt the model to diverse robotic tasks
  2. post-training stage is intended to specialize it to mobile manipulation

post-training is when they add the action expert

Architecture

architecture can represent both action chunk distributions and tokenized text outputs

For subtask inference it’s written as

  • - overall task prompt
  • - high level subtask / answer to vla prompt in web-data
  • - predicted action chunk

We can decompose this as

Action distribution does NOT depend on (overall prompt) but depends on the high level subtask

Model takes in N multimodal input tokens (), token implies discretized and continuous inputs

We produce a sequence of multimodal outputs , which we write as

  • - token type
  • - Attention Mask, tells model which tokens are allowed to attend to which ones
    • Attention Mask (Detail)
    • Prefix block (P)
      • P tokens attend bidirectionally to each other (images/prompt/state can interact freely)

The model outputs both text and actions the output of is split into text token logits and action output tokens ()

First M are text token logits that can be used to sample , later H tokens are produced by a separate action expert

Discrete and Continuous Action Representations