All Tools
K
Fine-tuningFreeOpen Source
KTRANSFORMERS
Run and fine-tune 100B+ MoE models on consumer GPUs via CPU-GPU heterogeneous computing
Apache-2.0
ABOUT
Ultra-large MoE models like DeepSeek-V3/R1 require multiple high-end GPUs with 80GB+ VRAM each, making them inaccessible to individual developers. KTransformers solves this by enabling CPU-GPU hybrid execution — placing hot experts on GPU and cold experts on CPU RAM — allowing 100B+ parameter models to run and fine-tune on a single consumer GPU (24GB VRAM) with system DRAM. It integrates with LLaMA-Factory for SFT/DPO, supports INT4/INT8 quantization, and achieves 3-28x speedup over pure offloading approaches.
INSTALL
pip install ktransformersINTEGRATION GUIDE
1. Fine-tune DeepSeek-V3/R1 on a single RTX 4090 using CPU-GPU heterogeneous MoE expert offloading
2. Run production inference on 200B+ MoE models locally by routing hot experts to GPU and cold experts to system RAM
3. Fine-tune large MoE models with LoRA/QLoRA via LLaMA-Factory integration using INT8/INT4 quantized CPU kernels
4. Deploy multi-concurrent inference serving for Kimi-K2, DeepSeek, and Qwen3-MoE on commodity hardware
TAGS
pythonllmfine-tuninginferencemoequantizationdeepseekconsumer-gpu