Blog Link: https://pytorch.org/blog/accelerating-generative-ai-2/

In large scale deep learning system the GPU is responsible for doing 100% of the work, CPU just tells the GPU what work it should be doing

An issue arises when the CPU tells the GPU to do an operation but the GPU has long finished the previous chunk of work

To solve this you could send a massive chunk of work at once to keep the GPU busy for long enough, this is easily accomplished during training by increasing your batch size but how do you do this during inference?

Utilize torch.compile

Reduce CPU overhead through torch.compile and static kv-cache

torch compile captures a large region into a single compiled region, when ran with mode=“reduce-overhead” it’s very effective at reducing CPU overhead

to apply it wrap a module with it