SINGER: An Onboard Generalist Vision-Language Navigation Policy for Drones

Paper Link: https://arxiv.org/abs/2509.18610v1

Semantic 3D Gaussian Splatting

They take 3D Gaussian Splatting scene reconstruction and add a semantic layer to it so the system can answer queries like “where is the chair” inside that reconstructed scene. CLIP embeddings are distilled into 3DGS

Spatially Spanning Trajectories

Pick a semantic object in the scene and compute it’s 3D centroid . That object becomes the root of an RRT* tree. This grows branches outwards through free space until they span much of the room in horizontal plane at object’s height

This builds a big tree from object outward to many possible drone starting regions

This is called spatially spanning: one tree around one object gives many valid approach paths from many places in the environment

Pipeline:

Find target object in 3D
Grow RRT* tree through empty space
Avoid obstacles using “bubbles”
1. Place collision buffers around points in sparse point cloud and forbid tree from entering those regions
Bias the final approach direction
Clean up the tree
Add motion and viewing behavior

This is a rough plan and should not be used until passed through a MPC expert

Policy Expert Data Synthesis

MPC expert flies the RRT* trajectories from leaf-to-root (or inverted)
At each time step they render an image from the 3D Gaussian Splatting scene, rendered RGB image at time step k
- The training is not done on the raw RGB, the image is passed through CLIPSeg to get a semantic representation
Also saved
- state where the drone is, velocity, etc.
- input control actions, thrust and attitude commands
Everything is saved at 20 hz, so 20 samples / second
Simulation is intentionally messed up: they randomize drone mass, thrust coefficient, pose and velocity perturbations every 2 seconds; they do this by up to 30% relative to the real drone parameters
They chop long simulated flights trajectories into 2-second chunks
- prevents overfitting to 1 long specific trajectory
- makes learning the problem like “react correctly from the current situation”
Instead of feeding raw camera images to the policy, they use CLIPSeg to create a semantic relevance map for the user’s query
- They color this map with a 3-channel colormap
  - bright red/orange = highly relevant to the query
  - yellow / green = moderately relevant
  - blue = irrelevant
- This helps because RGB changes a lot across environments but semantic maps are more stable across environments

Ayush Garg

Recently Updated

Pareto Principle

Bits

Magnitude of a normalized floating-point number

Mixed Precision Training

SINGER: An Onboard Generalist Vision-Language Navigation Policy for Drones

Semantic 3D Gaussian Splatting

Spatially Spanning Trajectories

Policy Expert Data Synthesis

Graph View

Table of Contents

Backlinks

Ayush Garg

Recently Updated

Pareto Principle

Bits

Magnitude of a normalized floating-point number

Mixed Precision Training

SINGER: An Onboard Generalist Vision-Language Navigation Policy for Drones

Semantic 3D Gaussian Splatting §

Spatially Spanning Trajectories §

Policy Expert Data Synthesis §

Graph View

Table of Contents

Backlinks

Semantic 3D Gaussian Splatting

Spatially Spanning Trajectories

Policy Expert Data Synthesis