Paper Link: https://research.nvidia.com/labs/lpr/locate-anything/LocateAnything.pdf

LocateAnything is a generative vision-language model for localization

This paper’s main contributions are:

  • Early exploration of applying multi-token prediction to VLM-based detection/grounding via Parallel Box Decoding
  • Hybrid decoding policy that detects unreliable parallel blocks and performs localized NTP re-decoding only for the problematic block

Architecture