HomeToolsMCPHow It WorksStoriesPhilosophyCommunityArchitectureStar on GitHub
All Tools
T
LLMFreeOpen Source

TENSORRT LLM

High-performance LLM inference on NVIDIA GPUs

Apache-2.0

ABOUT

Running large language model inference at production scale is prohibitively slow and expensive on generic runtimes. TensorRT-LLM solves this by providing a highly optimized inference engine with specialized CUDA kernels, advanced parallelism strategies, quantization support, and low-latency scheduling to maximize throughput and minimize latency on NVIDIA hardware.

INSTALL
pip install tensorrt_llm

INTEGRATION GUIDE

1. High-throughput LLM serving in data centers and cloud environments 2. Deploying quantized models (FP8, FP4, INT8, AWQ) for reduced memory and latency 3. Multi-GPU and multi-node distributed inference with Tensor/Pipeline/Expert parallelism 4. Visual generation inference (diffusion models) acceleration 5. Production serving integrated with Triton Inference Server and NVIDIA Dynamo

TAGS

inferencenvidiagpucudaquantizationspeculative-decodingdistributed-inferencetritonvisual-generation