Home OpenAI How Radial Attention Cuts Costs in Video Diffusion by 4.4× Without Sacrificing Quality

OpenAI

How Radial Attention Cuts Costs in Video Diffusion by 4.4× Without Sacrificing Quality

adminUpdated 3 weeks Ago3 Mins read14 Views

How Radial Attention Cuts Costs in Video Diffusion by 4.4× Without Sacrificing Quality

Introduction to Video Diffusion Models and Computational Challenges

Diffusion models have made impressive progress in generating high-quality, coherent videos, building on their success in image synthesis. However, handling the extra temporal dimension in videos significantly increases computational demands, especially since self-attention scales poorly with sequence length. This makes it difficult to train or run these models efficiently on long videos. Attempts like Sparse VideoGen utilize attention head classification to accelerate inference, but they struggle with accuracy and generalization during training. Other methods replace softmax attention with linear alternatives, although these often necessitate significant architectural changes. Interestingly, the natural energy decay of signals over time in physics inspires new, more efficient modeling strategies.

Evolution of Attention Mechanisms in Video Synthesis

Early video models extended 2D architectures by incorporating temporal components, but newer approaches, such as DiT and Latte, enhance spatial-temporal modeling through advanced attention mechanisms. While 3D dense attention achieves state-of-the-art performance, its computational cost increases rapidly with video length, making the generation of long videos expensive. Techniques such as timestep distillation, quantization, and sparse attention help reduce this burden, but often overlook the unique structure of video data. Although alternatives like linear or hierarchical attention improve efficiency, they typically struggle to maintain detail or scale effectively in practice.

Introduction to Spatiotemporal Energy Decay and Radial Attention

Researchers from MIT, NVIDIA, Princeton, UC Berkeley, Stanford, and First Intelligence have identified a phenomenon in video diffusion models called Spatiotemporal Energy Decay, where attention scores between tokens decline as spatial or temporal distance increases, mirroring how signals naturally fade. Motivated by this, they proposed Radial Attention, a sparse attention mechanism with O(n log n) complexity. It uses a static attention mask where tokens attend mostly to nearby ones, with the attention window shrinking over time. This enables pre-trained models to generate videos up to four times longer, reducing training costs by 4.4 times and inference time by 3.7 times, all while preserving video quality.

Sparse Attention Using Energy Decay Principles

Radial Attention is based on the insight that attention scores in video models decrease with increasing spatial and temporal distance, a phenomenon known as Spatiotemporal Energy Decay. Instead of attending to all tokens equally, Radial Attention strategically reduces computation where attention is weaker. It introduces a sparse attention mask that decays exponentially outward in both space and time, preserving only the most relevant interactions. This results in an O(n log n) complexity, making it significantly faster and more efficient than dense attention. Additionally, with minimal fine-tuning using LoRA adapters, pre-trained models can be adapted to generate much longer videos efficiently and effectively.

Evaluation Across Video Diffusion Models

Radial Attention is evaluated on three leading text-to-video diffusion models: Mochi 1, HunyuanVideo, and Wan2.1, demonstrating both speed and quality improvements. Compared to existing sparse attention baselines, such as SVG and PowerAttention, Radial Attention offers better perceptual quality and significant computational gains, including up to 3.7 times faster inference and 4.4 times lower training cost for extended videos. It scales efficiently to 4× longer video lengths and maintains compatibility with existing LoRAs, including style ones. Importantly, LoRA fine-tuning with Radial Attention outperforms full fine-tuning in some cases, demonstrating its effectiveness and resource efficiency for high-quality long-video generation.

Conclusion: Scalable and Efficient Long Video Generation

In conclusion, Radial Attention is a sparse attention mechanism designed to handle long video generation in diffusion models efficiently. Inspired by the observed decline in attention scores with increasing spatial and temporal distances, a phenomenon the researchers term Spatiotemporal Energy Decay Radial Attention, this approach mimics the natural decay to reduce computation. It utilizes a static attention pattern with exponentially shrinking windows, achieving up to 1.9 times faster performance and supporting videos up to 4 times longer. With lightweight LoRA-based fine-tuning, it significantly cuts down training (by 4.4×) and inference (by 3.7×) costs, all while preserving video quality across multiple state-of-the-art diffusion models.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.

Source link

Previous post Better Code Merging with Less Compute: Meet Osmosis-Apply-1.7B from Osmosis AI

Next post Google AI Just Open-Sourced a MCP Toolbox to Let AI Agents Query Databases Safely and Efficiently

Apple Researchers Introduce FastVLM: Achieving State-of-the-Art Resolution-Latency-Accuracy Trade-off in Vision Language Models

Vision Language Models (VLMs) allow both text inputs and visual understanding. However,...

admin3 Mins read

OpenAI

A Coding Guide to Build a Scalable Multi-Agent System with Google ADK

In this tutorial, we explore the advanced capabilities of Google’s Agent Development...

admin7 Mins read

OpenAI

Too Much Thinking Can Break LLMs: Inverse Scaling in Test-Time Compute

Recent advances in large language models (LLMs) have encouraged the idea that...

admin3 Mins read

OpenAI

Rubrics as Rewards (RaR): A Reinforcement Learning Framework for Training Language Models with Structured, Multi-Criteria Evaluation Signals

Reinforcement Learning with Verifiable Rewards (RLVR) allows LLMs to perform complex reasoning...

admin3 Mins read

This Week

Building a Context-Aware Multi-Agent AI System Using Nomic Embeddings and Gemini LLM

VLM2Vec-V2: A Unified Computer Vision Framework for Multimodal Embedding Learning Across Images, Videos, and Visual Documents

Key Factors That Drive Successful MCP Implementation and Adoption

Weekly Newsletter

How Radial Attention Cuts Costs in Video Diffusion by 4.4× Without Sacrificing Quality

Introduction to Video Diffusion Models and Computational Challenges

Evolution of Attention Mechanisms in Video Synthesis

Introduction to Spatiotemporal Energy Decay and Radial Attention

Sparse Attention Using Energy Decay Principles

Evaluation Across Video Diffusion Models

Conclusion: Scalable and Efficient Long Video Generation

Leave a comment

Leave a Reply Cancel reply

Latest Posts

VLM2Vec-V2: A Unified Computer Vision Framework for Multimodal Embedding Learning Across Images, Videos, and Visual Documents

Key Factors That Drive Successful MCP Implementation and Adoption

NVIDIA AI Dev Team Releases Llama Nemotron Super v1.5: Setting New Standards in Reasoning and Agentic AI

Why Context Matters: Transforming AI Model Evaluation with Contextualized Queries

Apple Researchers Introduce FastVLM: Achieving State-of-the-Art Resolution-Latency-Accuracy Trade-off in Vision Language Models

A Coding Guide to Build a Scalable Multi-Agent System with Google ADK

Too Much Thinking Can Break LLMs: Inverse Scaling in Test-Time Compute

Rubrics as Rewards (RaR): A Reinforcement Learning Framework for Training Language Models with Structured, Multi-Criteria Evaluation Signals

Get to Know Us

keep in touch