Vision tokens are used in Vision Transformers

Vision tokens are small chunks / patches of an image that are treated like words in a sentence

Eg. 224 x 224 pixel image if split into 16 x 16 patches gives you 196 patches (similar to 196 words)