Vision tokens are used in Vision Transformers
Vision tokens are small chunks / patches of an image that are treated like words in a sentence
Eg. 224 x 224 pixel image if split into 16 x 16 patches gives you 196 patches (similar to 196 words)
Vision tokens are used in Vision Transformers
Vision tokens are small chunks / patches of an image that are treated like words in a sentence
Eg. 224 x 224 pixel image if split into 16 x 16 patches gives you 196 patches (similar to 196 words)