All Tools
F
Fine-tuningFreeOpen Source
FLASH ATTENTION
Fast and memory-efficient exact attention for transformers
BSD-3-Clause
ABOUT
Transformer models are bottlenecked by GPU memory bandwidth in their attention layers. Standard attention reads the full attention matrix from HBM to SRAM, consuming O(N²) memory for sequence length N and making long-context training prohibitively slow and memory-intensive. Flash Attention computes exact attention with sublinear I/O complexity by tiling the attention computation across GPU SRAM and avoiding materialization of the full N² matrix, enabling developers to train and fine-tune models with significantly longer sequences, lower memory usage, and faster throughput while maintaining numerical exactness.
INSTALL
pip install flash-attnINTEGRATION GUIDE
1. Accelerate LLM pre-training by 2-4x with exact attention that reduces GPU memory bandwidth pressure during training
2. Fine-tune and inference with 10x longer context windows by avoiding O(N²) memory growth in the attention layer
3. Train and deploy transformer models on consumer GPUs that would otherwise run out of memory with standard attention
TAGS
attentiontransformertraininginferencegpumemory-efficientdeep-learningllmoptimization