Paper Link: https://arxiv.org/abs/2509.18610v1

Semantic 3D Gaussian Splatting

They take 3D Gaussian Splatting scene reconstruction and add a semantic layer to it so the system can answer queries like “where is the chair” inside that reconstructed scene. CLIP embeddings are distilled into 3DGS

Spatially Spanning Trajectories

Pick a semantic object in the scene and compute it’s 3D centroid . That object becomes the root of an RRT* tree. This grows branches outwards through free space until they span much of the room in horizontal plane at object’s height

This builds a big tree from object outward to many possible drone starting regions

This is called spatially spanning: one tree around one object gives many valid approach paths from many places in the environment

Pipeline:

  1. Find target object in 3D
  2. Grow RRT* tree through empty space
  3. Avoid obstacles using “bubbles”
    1. Place collision buffers around points in sparse point cloud and forbid tree from entering those regions
  4. Bias the final approach direction
  5. Clean up the tree
  6. Add motion and viewing behavior

This is a rough plan and should not be used until passed through a MPC expert

Policy Expert Data Synthesis

  • MPC expert flies the RRT* trajectories from leaf-to-root (or inverted)
  • At each time step they render an image from the 3D Gaussian Splatting scene, rendered RGB image at time step k
    • The training is not done on the raw RGB, the image is passed through CLIPSeg to get a semantic representation
  • Also saved
    • state where the drone is, velocity, etc.
    • input control actions, thrust and attitude commands
  • Everything is saved at 20 hz, so 20 samples / second
  • Simulation is intentionally messed up: they randomize drone mass, thrust coefficient, pose and velocity perturbations every 2 seconds; they do this by up to 30% relative to the real drone parameters
  • They chop long simulated flights trajectories into 2-second chunks
    • prevents overfitting to 1 long specific trajectory
    • makes learning the problem like “react correctly from the current situation”
  • Instead of feeding raw camera images to the policy, they use CLIPSeg to create a semantic relevance map for the user’s query
    • They color this map with a 3-channel colormap
      • bright red/orange = highly relevant to the query
      • yellow / green = moderately relevant
      • blue = irrelevant
    • This helps because RGB changes a lot across environments but semantic maps are more stable across environments