vLLM stands for Virtual Large Language Model and is an open source library that supports LLMs in inferencing and model serving efficiently

Article Link: https://www.hopsworks.ai/dictionary/vllm

Background

vLLM was first introduced in a paper Efficient Memory Management for Large Language Model Serving with Paged Attention authored by Kwon et al. The paper identified several issues with LLMs main of which being memory allocation which can lead to negative impact on their performance

The paper specifically emphasizes the inefficiency of managing Key-Value cache memory in current LLM serving systems. This can result in slow inference speed and high memory footprint

Solution

To fix this issue the paper presents PagedAttention an algorithm inspired by virtual memory and paging techniques used commonly used in operating systems.

PagedAttention enables efficient memory management by allowing for non-contiguous (objects not adjacent or sequentially connected) storage of attention keys and values

Following this idea the paper develops vLLM, a LLM serving engine built on PageAttention. vLLM achieves near-zero waste in KV cache memory.

Other techniques used

vLLM uses a lot of techniques to optimize LLM serving

  • Continuous Batching: Incoming requests are continuously batched together to maximize hardware utilization and reduce computing waste (minimizing idle time)
  • Quantization: vLLM utilized quantization techniques like FP16 to optimize memory usage
  • Optimized CUDA Kernels: vLLM hand-tunes the code executed on the GPU for maximum performance.