Introduction to Video Diffusion Models and Computational Challenges
Diffusion models have made impressive progress in generating high-quality, coherent videos, building on their success in image synthesis. However, handling the extra temporal dimension in videos significantly increases computational demands, especially since self-attention scales poorly with sequence length. This makes it difficult to train or run these models efficiently on long videos. Attempts like Sparse VideoGen utilize attention head classification to accelerate inference, but they struggle with accuracy and generalization during training. Other methods replace softmax attention with linear alternatives, although these often necessitate significant architectural changes. Interestingly, the natural energy decay of signals over time in physics inspires new, more efficient modeling strategies.
Evolution of Attention Mechanisms in Video Synthesis
Early video models extended 2D architectures by incorporating temporal components, but newer approaches, such as DiT and Latte, enhance spatial-temporal modeling through advanced attention mechanisms. While 3D dense attention achieves state-of-the-art performance, its computational cost increases rapidly with video length, making the generation of long videos expensive. Techniques such as timestep distillation, quantization, and sparse attention help reduce this burden, but often overlook the unique structure of video data. Although alternatives like linear or hierarchical attention improve efficiency, they typically struggle to maintain detail or scale effectively in practice.
Introduction to Spatiotemporal Energy Decay and Radial Attention
Researchers from MIT, NVIDIA, Princeton, UC Berkeley, Stanford, and First Intelligence have identified a phenomenon in video diffusion models called Spatiotemporal Energy Decay, where attention scores between tokens decline as spatial or temporal distance increases, mirroring how signals naturally fade. Motivated by this, they proposed Radial Attention, a sparse attention mechanism with O(n log n) complexity. It uses a static attention mask where tokens attend mostly to nearby ones, with the attention window shrinking over time. This enables pre-trained models to generate videos up to four times longer, reducing training costs by 4.4 times and inference time by 3.7 times, all while preserving video quality.
Sparse Attention Using Energy Decay Principles
Radial Attention is based on the insight that attention scores in video models decrease with increasing spatial and temporal distance, a phenomenon known as Spatiotemporal Energy Decay. Instead of attending to all tokens equally, Radial Attention strategically reduces computation where attention is weaker. It introduces a sparse attention mask that decays exponentially outward in both space and time, preserving only the most relevant interactions. This results in an O(n log n) complexity, making it significantly faster and more efficient than dense attention. Additionally, with minimal fine-tuning using LoRA adapters, pre-trained models can be adapted to generate much longer videos efficiently and effectively.
Evaluation Across Video Diffusion Models
Radial Attention is evaluated on three leading text-to-video diffusion models: Mochi 1, HunyuanVideo, and Wan2.1, demonstrating both speed and quality improvements. Compared to existing sparse attention baselines, such as SVG and PowerAttention, Radial Attention offers better perceptual quality and significant computational gains, including up to 3.7 times faster inference and 4.4 times lower training cost for extended videos. It scales efficiently to 4× longer video lengths and maintains compatibility with existing LoRAs, including style ones. Importantly, LoRA fine-tuning with Radial Attention outperforms full fine-tuning in some cases, demonstrating its effectiveness and resource efficiency for high-quality long-video generation.

Conclusion: Scalable and Efficient Long Video Generation
In conclusion, Radial Attention is a sparse attention mechanism designed to handle long video generation in diffusion models efficiently. Inspired by the observed decline in attention scores with increasing spatial and temporal distances, a phenomenon the researchers term Spatiotemporal Energy Decay Radial Attention, this approach mimics the natural decay to reduce computation. It utilizes a static attention pattern with exponentially shrinking windows, achieving up to 1.9 times faster performance and supporting videos up to 4 times longer. With lightweight LoRA-based fine-tuning, it significantly cuts down training (by 4.4×) and inference (by 3.7×) costs, all while preserving video quality across multiple state-of-the-art diffusion models.
Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.
Leave a comment