Fine-tuningFreeOpen Source

FLASH ATTENTION

Fast and memory-efficient exact attention for transformers

BSD-3-Clause

ABOUT

Transformer models are bottlenecked by GPU memory bandwidth in their attention layers. Standard attention reads the full attention matrix from HBM to SRAM, consuming O(N²) memory for sequence length N and making long-context training prohibitively slow and memory-intensive. Flash Attention computes exact attention with sublinear I/O complexity by tiling the attention computation across GPU SRAM and avoiding materialization of the full N² matrix, enabling developers to train and fine-tune models with significantly longer sequences, lower memory usage, and faster throughput while maintaining numerical exactness.

INSTALL

pip install flash-attn

INTEGRATION GUIDE

1. Accelerate LLM pre-training by 2-4x with exact attention that reduces GPU memory bandwidth pressure during training 2. Fine-tune and inference with 10x longer context windows by avoiding O(N²) memory growth in the attention layer 3. Train and deploy transformer models on consumer GPUs that would otherwise run out of memory with standard attention

FLASH ATTENTION

ABOUT

INTEGRATION GUIDE

TAGS