Paper Link: https://arxiv.org/abs/2510.26742v1
Achieved latency of 27.3 ms given 2 input views
Eliminating the CPU overhead
Neural network inference is driven by Python code that launches the UCDA kernels. Python part has significant overhead when the number of kernels is large (1000+ in pi 0)
There are several Ahead-Of-Time (AOT) or Just in Time Compilation techniques available