Paper Link: https://arxiv.org/abs/2509.18610v1
Semantic 3D Gaussian Splatting
They take 3D Gaussian Splatting scene reconstruction and add a semantic layer to it so the system can answer queries like “where is the chair” inside that reconstructed scene. CLIP embeddings are distilled into 3DGS
Spatially Spanning Trajectories
Pick a semantic object in the scene and compute it’s 3D centroid
This builds a big tree from object outward to many possible drone starting regions
This is called spatially spanning: one tree around one object gives many valid approach paths from many places in the environment
Pipeline:
- Find target object in 3D
- Grow RRT* tree through empty space
- Avoid obstacles using “bubbles”
- Place collision buffers around points in sparse point cloud and forbid tree from entering those regions
- Bias the final approach direction
- Clean up the tree
- Add motion and viewing behavior
This is a rough plan and should not be used until passed through a MPC expert
Policy Expert Data Synthesis
- MPC expert flies the RRT* trajectories from leaf-to-root (or inverted)
- At each time step they render an image from the 3D Gaussian Splatting scene,
rendered RGB image at time step k - The training is not done on the raw RGB, the image is passed through CLIPSeg to get a semantic representation
- The training is not done on the raw RGB, the image is passed through CLIPSeg to get a semantic representation
- Also saved
- state
where the drone is, velocity, etc. - input
control actions, thrust and attitude commands
- state
- Everything is saved at 20 hz, so 20 samples / second
- Simulation is intentionally messed up: they randomize drone mass, thrust coefficient, pose and velocity perturbations every 2 seconds; they do this by up to 30% relative to the real drone parameters
- They chop long simulated flights trajectories into 2-second chunks
- prevents overfitting to 1 long specific trajectory
- makes learning the problem like “react correctly from the current situation”
- Instead of feeding raw camera images to the policy, they use CLIPSeg to create a semantic relevance map for the user’s query
- They color this map with a 3-channel colormap
- bright red/orange = highly relevant to the query
- yellow / green = moderately relevant
- blue = irrelevant
- This helps because RGB changes a lot across environments but semantic maps are more stable across environments
- They color this map with a 3-channel colormap