pi 0.5: a Vision-Language-Action Model with Open-World Generalization

Paper Link: https://www.pi.website/download/pi05.pdf

Vision Language Action (VLA) model

VLA total params = Vision Language Model (VLM) params + action head params

Vision Language Action (VLA) models are initialized with the weights of the pre-trained VLM

Training

pi 0.5 paper goes very into detail about the training mixtures used to train the model which helps improve its generalizability, which I am not going to go into detail here

VLAs are trained via imitation learning on robot demonstrations dataset D, by maximizing log-likelihood of an action given an observation and natural language task

The observation typically contains one or more images

pi 0.5 is trained in 2 stages:

pre-training stage to adapt the model to diverse robotic tasks
post-training stage is intended to specialize it to mobile manipulation

post-training is when they add the action expert

Architecture

architecture can represent both action chunk distributions and tokenized text outputs

For subtask inference it’s written as

- overall task prompt
- high level subtask / answer to vla prompt in web-data
- predicted action chunk

We can decompose this as

Action distribution does NOT depend on (overall prompt) but depends on the high level subtask

Model takes in N multimodal input tokens (), token implies discretized and continuous inputs

We produce a sequence of multimodal outputs , which we write as

- token type
- Attention Mask, tells model which tokens are allowed to attend to which ones
- Attention Mask (Detail)
- Prefix block (P)
  - P tokens attend bidirectionally to each other (images/prompt/state can interact freely)

The model outputs both text and actions the output of is split into text token logits and action output tokens ()

First M are text token logits that can be used to sample , later H tokens are produced by a separate action expert

Ayush Garg

Recently Updated

Pareto Principle

Bits

Magnitude of a normalized floating-point number

Mixed Precision Training

pi 0.5: a Vision-Language-Action Model with Open-World Generalization

Training

Architecture

Discrete and Continuous Action Representations

Graph View

Table of Contents

Backlinks

Ayush Garg

Recently Updated

Pareto Principle

Bits

Magnitude of a normalized floating-point number

Mixed Precision Training

pi 0.5: a Vision-Language-Action Model with Open-World Generalization

Training §

Architecture §

Discrete and Continuous Action Representations §

Graph View

Table of Contents

Backlinks

Training

Architecture

Discrete and Continuous Action Representations