All Tools
T
LLMFreeOpen Source
TENSORRT LLM
High-performance LLM inference on NVIDIA GPUs
Apache-2.0
ABOUT
Running large language model inference at production scale is prohibitively slow and expensive on generic runtimes. TensorRT-LLM solves this by providing a highly optimized inference engine with specialized CUDA kernels, advanced parallelism strategies, quantization support, and low-latency scheduling to maximize throughput and minimize latency on NVIDIA hardware.
INSTALL
pip install tensorrt_llmINTEGRATION GUIDE
1. High-throughput LLM serving in data centers and cloud environments
2. Deploying quantized models (FP8, FP4, INT8, AWQ) for reduced memory and latency
3. Multi-GPU and multi-node distributed inference with Tensor/Pipeline/Expert parallelism
4. Visual generation inference (diffusion models) acceleration
5. Production serving integrated with Triton Inference Server and NVIDIA Dynamo
TAGS
inferencenvidiagpucudaquantizationspeculative-decodingdistributed-inferencetritonvisual-generation