Blog Link: https://pytorch.org/blog/accelerating-generative-ai-2/
In large scale deep learning system the GPU is responsible for doing 100% of the work, CPU just tells the GPU what work it should be doing
An issue arises when the CPU tells the GPU to do an operation but the GPU has long finished the previous chunk of work
To solve this you could send a massive chunk of work at once to keep the GPU busy for long enough, this is easily accomplished during training by increasing your batch size but how do you do this during inference?
Utilize torch.compile
Reduce CPU overhead through torch.compile and static kv-cache
torch compile captures a large region into a single compiled region, when ran with mode=“reduce-overhead” it’s very effective at reducing CPU overhead
to apply it wrap a module with it