Nvidia TensorRT is a SDK for optimizing deep learning models to enable high-performance inference

TensorRT operates in two phases, in first phase you provide TensorRT with a model definition and TensorRT optimizes it for a target GPU. In the second phase, you use the optimized model to run inference

NVIDIA Docs: https://docs.nvidia.com/deeplearning/tensorrt/latest/architecture/capabilities.html

Build Phase

Builder is responsible for optimizing a model and producing an Engine

To build an engine:

  1. Create a network definition
  2. Specify a config for the builder
  3. Call the builder to create the engine

Network definition defines the model. Most common way to transfer a model to TensorRT

Builder Config interface specifies how TensorRT should optimize the model. You can modify TensorRT’s ability to reduce the precision of calculations, control tradeoff between memory and runtime execution speed, and constraint choice of CUDA kernels. You can control how builder searches for kernels and cached search results

Builder eliminated dead computations, folds constants, reorders, and combines operations to run efficiently on the GPU

Reduce the floating-point computations by running them in 16-bit floating point or quantizing floating-point values so that calculations can be performed using 8-bit integers.

Runtime Phase

Highest-level interface for the execution phase

When using runtime, you will typically carry out the following steps:

  1. Deserialize a plan to create an engine
  2. Create an execution context from the engine

Engine interface represents an optimized model, you can query an engine for information about input and output of tensors of the network

Executing Context (created from the engine) is the interface for invoking inference. Execution context is the main interface for invoking inference

You must setup the input and output buffers in appropriate locations when invoking inference, this can be either in CPU or GPU memory