Paper Link: https://arxiv.org/abs/2510.26742v1

Achieved latency of 27.3 ms given 2 input views

Eliminating the CPU overhead

Neural network inference is driven by Python code that launches the UCDA kernels. Python part has significant overhead when the number of kernels is large (1000+ in pi 0)

There are several Ahead-Of-Time (AOT) or Just in Time Compilation techniques available