Home OpenAI DELTA: A Novel AI Method that Efficiently (10x Faster) Tracks Every Pixel in 3D Space from Monocular Videos

OpenAI

DELTA: A Novel AI Method that Efficiently (10x Faster) Tracks Every Pixel in 3D Space from Monocular Videos

adminUpdated 8 months Ago3 Mins read64 Views

DELTA: A Novel AI Method that Efficiently (10x Faster) Tracks Every Pixel in 3D Space from Monocular Videos

Tracking dense 3D motion from monocular videos remains challenging, particularly when aiming for pixel-level precision over long sequences. Existing methods face challenges in achieving detailed 3D tracking because they often track only a few points, which need more detail for full-scene understanding. They also demand computational power, making it difficult to handle long videos efficiently. Additionally, many of them must be fixed to maintain accuracy over extended sequences, as problems like camera movement and object occlusion cause the model to lose track or introduce errors.

Current methods include several approaches for estimating motion in video sequences, each with unique strengths and limitations. Optical flow techniques provide dense pixel-wise tracking but struggle with robustness in complex scenes, especially when extended to long sequences. Scene Flow generalizes optical flow to estimate dense 3D motion, using either RGB-D data or point clouds, but it remains challenging to apply efficiently over long sequences. Point tracking captures motion trajectories by tracking specific points, with recent advancements incorporating spatial and temporal attention for smoother tracking. However, point-tracking methods still need to improve in achieving dense monitoring due to the high computational cost. Tracking by Reconstructing methods uses a deformation field to estimate motion making them less practical for real-time applications.

A team of researchers from UMass Amherst & MIT-IBM Watson AI Lab, Snap Inc. have proposed DELTA (Dense Efficient Long-range 3D Tracking for Any video), the first method designed to efficiently track every pixel in 3D space across long video sequences. DELTA operates by starting with reduced-resolution tracking via spatio-temporal attention and applying an attention-based upsampler for high-resolution accuracy. Key innovations include an upsampler for sharp motion boundaries, an efficient spatial attention architecture for dense tracking, and a log-depth representation that enhances tracking performance. DELTA achieves state-of-the-art results on the CVO and Kubric3D datasets, showing over 10% improvement in metrics like Average Jaccard (AJ) and Average Position Difference in 3D (APD3D), and performs competitively on 3D point tracking benchmarks such as TAP-Vid3D and LSFOdyssey. Unlike existing methods, DELTA delivers dense 3D tracking at scale, running over 8x faster than previous methods while achieving state-of-the-art accuracy.

An experiment conducted showed that DELTA excels in 3D tracking tasks, outperforming previous methods in speed and accuracy. Trained on Kubric’s dataset with over 5,600 videos, DELTA’s loss function combines 2D coordinate, depth, and visibility losses.

In benchmarks, DELTA achieved top scores on CVO for long-range 2D tracking and on Kubric3D for dense 3D tracking, completing tasks much faster than other methods. DELTA’s design choices, including log-depth representation, spatial attention, and an attention-based upsampler, significantly enhance its accuracy and efficiency across diverse tracking scenarios.

In conclusion, DELTA is a highly efficient method for tracking every pixel across video frames, achieving accuracy in dense 2D and 3D tracking with a faster runtime than existing methods. The model may need help with points that remain occluded for extended periods and perform best on videos with fewer than several hundred frames. The approach has limitations similar to those of earlier methods as it utilizes shorter temporal processing windows. Moreover, the method’s 3D tracking accuracy relies on the precision and temporal stability of the monocular depth estimation used. Anticipated monocular depth estimation research improvements will likely enhance the method’s performance further.

Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[Sponsorship Opportunity with us] Promote Your Research/Product/Webinar with 1Million+ Monthly Readers and 500k+ Community Members

Nazmi Syed is a consulting intern at MarktechPost and is pursuing a Bachelor of Science degree at the Indian Institute of Technology (IIT) Kharagpur. She has a deep passion for Data Science and actively explores the wide-ranging applications of artificial intelligence across various industries. Fascinated by technological advancements, Nazmi is committed to understanding and implementing cutting-edge innovations in real-world contexts.

Listen to our latest AI podcasts and AI research videos here ➡️

Source link

Previous post Real Identities Can Be Recovered From Synthetic Datasets

Next post Birago Jones, Co-Founder and CEO of Pienso - Interview Series

Gemini Embedding-001 Now Available: Multilingual AI Text Embeddings via Google API

Google’s Gemini Embedding text model, gemini-embedding-001, is now...

admin3 Mins read

OpenAI

What Makes MetaStone-S1 the Leading Reflective Generative Model for AI Reasoning?

Researchers from MetaStone-AI & USTC introduce a...

admin2 Mins read

OpenAI

Amazon Releases Kiro: An AI IDE That Empowers Developers with Agentic Automation

Amazon has unveiled Kiro, a groundbreaking agentic Integrated Development Environment (IDE) designed...

admin4 Mins read

OpenAI

Fractional Reasoning in LLMs: A New Way to Control Inference Depth

What is included in this article: The limitations of current test-time compute...

admin3 Mins read

This Week

How Radial Attention Cuts Costs in Video Diffusion by 4.4× Without Sacrificing Quality

Better Code Merging with Less Compute: Meet Osmosis-Apply-1.7B from Osmosis AI

ByteDance Just Released Trae Agent: An LLM-based Agent for General Purpose Software Engineering Tasks

Weekly Newsletter

DELTA: A Novel AI Method that Efficiently (10x Faster) Tracks Every Pixel in 3D Space from Monocular Videos

Leave a comment

Leave a Reply Cancel reply

Latest Posts

Better Code Merging with Less Compute: Meet Osmosis-Apply-1.7B from Osmosis AI

ByteDance Just Released Trae Agent: An LLM-based Agent for General Purpose Software Engineering Tasks

SynPref-40M and Skywork-Reward-V2: Scalable Human-AI Alignment for State-of-the-Art Reward Models

Getting Started with Agent Communication Protocol (ACP): Build a Weather Agent with Python

Gemini Embedding-001 Now Available: Multilingual AI Text Embeddings via Google API

What Makes MetaStone-S1 the Leading Reflective Generative Model for AI Reasoning?

Amazon Releases Kiro: An AI IDE That Empowers Developers with Agentic Automation

Fractional Reasoning in LLMs: A New Way to Control Inference Depth

Get to Know Us

keep in touch