TENSORRT LLM

High-performance LLM inference on NVIDIA GPUs

Apache-2.0

ABOUT

Running large language model inference at production scale is prohibitively slow and expensive on generic runtimes. TensorRT-LLM solves this by providing a highly optimized inference engine with specialized CUDA kernels, advanced parallelism strategies, quantization support, and low-latency scheduling to maximize throughput and minimize latency on NVIDIA hardware.

INSTALL

pip install tensorrt_llm

INTEGRATION GUIDE

1. High-throughput LLM serving in data centers and cloud environments 2. Deploying quantized models (FP8, FP4, INT8, AWQ) for reduced memory and latency 3. Multi-GPU and multi-node distributed inference with Tensor/Pipeline/Expert parallelism 4. Visual generation inference (diffusion models) acceleration 5. Production serving integrated with Triton Inference Server and NVIDIA Dynamo

TENSORRT LLM

ABOUT

INTEGRATION GUIDE

TAGS