All Tools
L
LLMFreeOpen Source
LORAX
Serve thousands of fine-tuned LLMs on a single GPU
Apache-2.0
ABOUT
Serving fine-tuned LLMs at scale is prohibitively expensive because each fine-tuned variant typically requires its own GPU or instance — a single customer-facing use case can require dozens of models with minor variations. LoRAX solves this by dynamically loading LoRA adapters per- request from a shared base model, enabling thousands of fine-tuned models to share a single GPU with heterogeneous continuous batching, reducing serving costs by 10-100x.
INSTALL
docker run --gpus all -p 8080:80 ghcr.io/predibase/lorax:latest --model-id <BASE_MODEL>
INTEGRATION GUIDE
1. Serve thousands of customer-specific fine-tuned models from a single GPU with per-request adapter switching
2. Deploy a multi-tenant LLM API where each tenant brings their own custom fine-tuned model
3. Run batch inference over a large collection of fine-tuned adapters for evaluation or research
4. Merge multiple LoRA adapters per request to create powerful ensemble predictions
5. Enable A/B testing of fine-tuned model variants in production without provisioning separate infrastructure
TAGS
llminferenceservinglorafine-tuningmodel-servingopen-source