Paper Link: https://arxiv.org/pdf/2506.01844

This paper was really cool read, might be because I think I finally understand why things are the way they are and everything is making sense

Cross Attention vs Self Attention

I think the intuition is:

  • cross-attention tells the action expert what is happening in the scene from the VLM features
  • self-attention lets the actions inside the chunk stay consistent with each other
  • the causal mask makes sure that consistency only flows forward in time, not backward from the future

That is probably also why the paper says causal self-attention gives smoother action chunks. The model can coordinate later actions based on earlier ones, while still respecting the actual order in which the robot would produce them.

Why project the state into one token

Projecting the sensorimotor state with a linear layer makes it match the LM hidden size, which means it can be treated like just another token in the transformer sequence. They only use a single token because the state is already compact and structured, unlike images which need many tokens.

Passing visual + language + state tokens together through the decoder means the features sent to the action expert are already grounded in:

  • what the robot sees
  • what the instruction is asking for
  • what the robot’s current physical state is

Action Expert Architecture

SmolVLA uses SmolVLM2 as the multimodal backbone, but the action expert is a separate transformer module on top of it.

So both are transformer-based, but they are not the same thing:

  • SmolVLM2 = the vision-language backbone that builds fused multimodal features
  • action expert = a smaller control transformer that conditions on those VLM features and predicts action chunks

I think the easiest way to think about it is that SmolVLM2 does the understanding, and the action expert does the motor control.