IRLFirst physical meetup — Bengaluru, Sat May 23, 4PM · RSVP on Luma
HomeToolsMCPHow It WorksStoriesPhilosophyCommunityArchitectureStar on GitHub
All Tools
L
LLMFreeOpen Source

LORAX

Serve thousands of fine-tuned LLMs on a single GPU

Apache-2.0

ABOUT

Serving fine-tuned LLMs at scale is prohibitively expensive because each fine-tuned variant typically requires its own GPU or instance — a single customer-facing use case can require dozens of models with minor variations. LoRAX solves this by dynamically loading LoRA adapters per- request from a shared base model, enabling thousands of fine-tuned models to share a single GPU with heterogeneous continuous batching, reducing serving costs by 10-100x.

INSTALL
docker run --gpus all -p 8080:80 ghcr.io/predibase/lorax:latest --model-id <BASE_MODEL>

INTEGRATION GUIDE

1. Serve thousands of customer-specific fine-tuned models from a single GPU with per-request adapter switching 2. Deploy a multi-tenant LLM API where each tenant brings their own custom fine-tuned model 3. Run batch inference over a large collection of fine-tuned adapters for evaluation or research 4. Merge multiple LoRA adapters per request to create powerful ensemble predictions 5. Enable A/B testing of fine-tuned model variants in production without provisioning separate infrastructure

TAGS

llminferenceservinglorafine-tuningmodel-servingopen-source