IRLFirst physical meetup — Bengaluru, Sat May 23, 4PM · RSVP on Luma
HomeToolsMCPHow It WorksStoriesPhilosophyCommunityArchitectureStar on GitHub
All Tools
K
Fine-tuningFreeOpen Source

KTRANSFORMERS

Run and fine-tune 100B+ MoE models on consumer GPUs via CPU-GPU heterogeneous computing

Apache-2.0

ABOUT

Ultra-large MoE models like DeepSeek-V3/R1 require multiple high-end GPUs with 80GB+ VRAM each, making them inaccessible to individual developers. KTransformers solves this by enabling CPU-GPU hybrid execution — placing hot experts on GPU and cold experts on CPU RAM — allowing 100B+ parameter models to run and fine-tune on a single consumer GPU (24GB VRAM) with system DRAM. It integrates with LLaMA-Factory for SFT/DPO, supports INT4/INT8 quantization, and achieves 3-28x speedup over pure offloading approaches.

INSTALL
pip install ktransformers

INTEGRATION GUIDE

1. Fine-tune DeepSeek-V3/R1 on a single RTX 4090 using CPU-GPU heterogeneous MoE expert offloading 2. Run production inference on 200B+ MoE models locally by routing hot experts to GPU and cold experts to system RAM 3. Fine-tune large MoE models with LoRA/QLoRA via LLaMA-Factory integration using INT8/INT4 quantized CPU kernels 4. Deploy multi-concurrent inference serving for Kimi-K2, DeepSeek, and Qwen3-MoE on commodity hardware

TAGS

pythonllmfine-tuninginferencemoequantizationdeepseekconsumer-gpu