Fine-tuningFreeOpen Source

KTRANSFORMERS

Run and fine-tune 100B+ MoE models on consumer GPUs via CPU-GPU heterogeneous computing

Apache-2.0

ABOUT

Ultra-large MoE models like DeepSeek-V3/R1 require multiple high-end GPUs with 80GB+ VRAM each, making them inaccessible to individual developers. KTransformers solves this by enabling CPU-GPU hybrid execution — placing hot experts on GPU and cold experts on CPU RAM — allowing 100B+ parameter models to run and fine-tune on a single consumer GPU (24GB VRAM) with system DRAM. It integrates with LLaMA-Factory for SFT/DPO, supports INT4/INT8 quantization, and achieves 3-28x speedup over pure offloading approaches.

INSTALL

pip install ktransformers

INTEGRATION GUIDE

1. Fine-tune DeepSeek-V3/R1 on a single RTX 4090 using CPU-GPU heterogeneous MoE expert offloading 2. Run production inference on 200B+ MoE models locally by routing hot experts to GPU and cold experts to system RAM 3. Fine-tune large MoE models with LoRA/QLoRA via LLaMA-Factory integration using INT8/INT4 quantized CPU kernels 4. Deploy multi-concurrent inference serving for Kimi-K2, DeepSeek, and Qwen3-MoE on commodity hardware

KTRANSFORMERS

ABOUT

INTEGRATION GUIDE

TAGS