Editor’s note – This is the first article in a two-part series:
- Part 1 – CoTracker3: Enhanced Point Tracking with Less Data (this article) – a primer on point tracking and CoTracker3
- Part 2 – CoTracker3: A Point Tracker Using Real Videos – a hands-on tutorial on how to run inference with the model and parse the output into FiftyOne format
A new semi-supervised approach achieves state-of-the-art performance with a thousandfold reduction in real data requirements.
Introduction to Point Tracking
In computer vision, point tracking estimates the movement of specific points within a video over time. In this task, a model is trained to identify and maintain correspondences between tracked points across multiple frames in a video.
However, accurately tracking points in videos isn’t a trivial task.
First, you have to worry about occlusions. This is a massive pain. I’d put this as the primary challenge for point trackers because objects in a scene can obstruct the view of tracked points, leading to inaccurate tracking. Second, a tracked point can leave the field of view. As objects and the camera move, tracked, points will go with them, often out of the camera’s field of view. This makes it difficult for some trackers to predict their trajectory. Third, you need to consider the appearance of points. Lighting and perspective changes can affect how a point looks. Rapid movements can make tracking difficult. Points may appear larger or smaller as objects move closer or farther from the camera. Last but not least is computational complexity. Computational complexity becomes a bottleneck when tracking many points simultaneously, especially considering dependencies between tracks.
The Applications of Point Tracking That Make it a Problem Worth Tackling
Despite what some folks may think, computer vision is not a solved problem, and the point tracking task has various applications that push the progress in other areas of computer vision. For example:
- Motion Analysis: Understanding motion provides insights into object movement and scene dynamics, which is essential for action recognition and tracking.
- 3D Reconstruction: You can infer 3D information by tracking points across multiple scene views.
- Video Editing and Special Effects: You can use point tracking to stabilize shaky footage, insert objects seamlessly, and apply effects that track specific points of interest
- Robotics and Autonomous Navigation: This scenario requires real-time tracking to perceive and interact with the environment, enabling tasks like obstacle avoidance and object manipulation.
A Brief History of Point Tracking
Historically, there have been a few dominant paradigms for this task.
Optical Flow
Optical flow models estimate dense instantaneous motion. They can be classical approaches, deep learning methods, or transformer-based methods.
- Classical Approaches: Traditional optical flow methods relied on brightness constancy equations and often combined local and global flow estimations3.
- Deep Learning-Based Methods: FlowNet and DCFlow used convolutional neural networks (CNNs) for this task.
- RAFT and its Variants: RAFT was novel as it used incremental flow updates and 4D cost volumes. This model inspired several follow-ups that further improved accuracy and efficiency.
- Transformer-Based Methods: Of course, in recent years, Transformers have been all the rage. For example, FlowFormer tokenizes the 4D cost volume, and GMFlow utilizes a softmax with self-attention for refinement. There’s also Perceiver IO, which proposed a unified transformer architecture for various tasks, including optical flow.
Optical flow methods are powerful for estimating dense instantaneous motion but are not ideal for long-term point tracking due to error accumulation.
Multi-Frame Optical Flow
Multi-frame optical flow models extend optical flow to multiple frames but are still not designed for long-term tracking or occlusion handling. While initial attempts to extend optical flow to multiple frames relied on Kalman filtering for temporal consistency, more recent approaches include:
- Modern Dense Flow Methods: Recent multi-frame optical flow models generate dense fields. RAFT can be adapted for multi-frame estimation through a warm-start approach. VideoFlow explicitly integrates forward and backward motion features across three to five consecutive frames to refine flow estimates.
- Multi-Flow Dense Tracker (MFT): MFT estimates flow between distant frames and selects the most reliable chain of optical flows to ensure consistent tracking.
The methods can estimate flow across multiple frames but are not designed for long-term point tracking and struggle with handling occlusions, especially when points are occluded for a long time.
Point Tracking
Point tracking models track sparse sets of points over time. Unlike optical flow, which estimates dense instantaneous motion, point tracking focuses on a sparse set of points and tries to maintain their correspondence across multiple frames. Some models that follow this paradigm include:
- Particle Video: Particle Video pioneered the Tracking Any Point (TAP) concept but was limited in handling occlusions.
- PIPs: PIPs, building upon Particle Video, introduced improvements to track points through occlusions more reliably. This model utilizes a sliding window approach and restarts tracks from the last visible frame of a point.
- TAP-Vid: TAP-Vid introduced a new benchmark and a simple baseline for TAP, and if there’s anything new benchmarks do, it’s pushing the field forward!
- TAPIR: Combining concepts from TAP-Vid and PIPs, TAPIR is a two-stage, feed-forward point tracker that significantly improves tracking performance, especially in occlusion handling.
- PIPs++: Addressing long-term tracking, PIPs++, a simplified version of PIPs, was introduced alongside a benchmark for long-term tracking.
- OmniMotion: OmniMotion optimizes a volumetric video representation and refines correspondences in a canonical space. It often achieves high accuracy but requires computationally expensive test-time optimization.
Many point tracking models, like PIPs and PIPs++, track points independently; however, points in a video often exhibit strong dependencies, such as belonging to the same object. The original CoTracker model uses these dependencies by performing joint tracking of many points.
CoTracker is a transformer-based model that tracks many 2D points in extended video sequences. Unlike previous methods that typically track points independently, CoTracker introduces the concept of joint tracking. This means paying attention to dependencies between tracked points, leading to enhanced accuracy and robustness, mainly when dealing with occlusions or points moving out of the camera view.
Here are the new techniques that CoTracker developed to improve on previous methods:
- CoTracker uses a transformer architecture with an attention mechanism to share information between tracked points to better understand scene motion and predict occluded points. This process, known as joint point tracking, differs from methods that track points independently. CoTracker is one of the few that uses deep networks for joint tracking.
- Support Points improve tracking accuracy, especially for tracking a single point or a few points. Support points (though not explicitly requested by the user or provided as an argument) offer additional context to the model. Different configurations include “global” (a regular grid across the image) and “local” (a grid centered around the target point). Experiments show that combining global and local support points gives the best performance.
- CoTracker introduces proxy tokens to address the computational cost associated with attention when dealing with many tracks. The architecture represents tracks using a grid of tokens, with each token encoding a specific track’s position, visibility, appearance, and correlation features at a given time. These tokens efficiently represent a subset of tracks, reducing memory complexity and enabling the model to jointly track a near-dense set of points on a single GPU during inference. This approach is like using registers to reduce memory complexity in other transformer architectures.
- While operating online with a sliding window approach, CoTracker uses unrolled training (borrowing the concept from recurrent networks). This method optimizes a network by unrolling its application over multiple overlapping windows during training. This improves long-term tracking capabilities, especially for occluded points. The windowed approach can process videos of arbitrary length by initializing subsequent windows with information from preceding ones, mimicking a recurrent network.
However, despite its impressive performance, CoTracker has some limitations:
- CoTracker is trained on synthetic data, which makes generalizing to complex real-world scenes – with elements like reflections and shadows – challenging.
- The model is sensitive to discontinuous videos, and performance might degrade in videos with multiple shots or discontinuous segments, as it is primarily designed for continuous video sequences.
Introducing CoTracker3
Building upon the foundation laid by CoTracker, CoTracker3 has a simpler architecture, improved data efficiency, and greater flexibility.
While maintaining the core concept of joint point tracking, CoTracker3 refines and streamlines various aspects, achieving state-of-the-art results with significantly less training data. Here are some of the improvements and innovations introduced by CoTracker3 that I feel are interesting:
Architectural Simplifications and Enhancements
- CoTracker3 uses a 4D correlation feature representation that captures spatial relationships between features in different frames (introduced in the LocoTrack paper). However, it simplifies the processing of these features by employing a straightforward multi-layer perceptron (MLP) instead of LocoTrack’s more complex ad-hoc module. This reduces computational overhead and maintains representational power.
- In CoTracker, visibility flags – indicating whether a point is visible or occluded – were updated by a separate network. CoTracker3 integrates this process directly into the main transformer, updating visibility flags alongside other track attributes at each iteration. This simplifies the architecture and improves efficiency.
- CoTracker3 has online and offline versions with the same architecture but different training procedures. The online version operates in a sliding window for real-time tracking, while the offline version processes the entire video for improved bi-directional tracking and handling of occlusions.
Semi-Supervised Training Pipeline
- CoTracker3 uses a semi-supervised training strategy using real-world videos without manual annotations. It uses multiple existing point trackers trained on synthetic data to generate pseudo-labels for real videos and then trains a student model on this larger, pseudo-labeled dataset.
- This approach is data efficient. It outperforms BootsTAPIR—a state-of-the-art tracker trained on 15 million real videos—using only 15,000 real videos, a whopping thousandfold decrease in data requirements.
- CoTracker3 uses SIFT feature detection to guide the selection of query points for pseudo-labeling, prioritizing points deemed “good to track.” This improves pseudo-labeled data quality and training stability.
Performance Improvements
- CoTracker3 consistently outperforms existing trackers on various benchmarks, including TAP-Vid, Dynamic Replica, and RoboTAP. It does an excellent job tracking occluded points, especially in offline mode. This is thanks to its joint tracking capability and access to the entire video sequence.
- CoTracker3’s architectural simplifications result in a leaner model that runs 27% faster than LocoTrack, the previous fastest point tracker, despite incorporating cross-track attention for joint tracking.
- Experiments with increasing amounts of pseudo-labeled data demonstrate that CoTracker3 continues to improve with more real-world data.
- CoTracker3 also benefits from self-training, where it is fine-tuned on its own predictions on real videos. This bridges the gap between synthetic training data and real-world scenarios.
Limitations
Despite its advancements, CoTracker3, like its predecessor, has its limitations:
- The pseudo-labeling pipeline’s performance depends on the quality and diversity of the chosen teacher models. As the student model approaches the performance of its teachers, its ability to improve will plateau, necessitating the introduction of stronger or more diverse teachers for continued advancement.
- Ensure diversity and representativeness in the pseudo-labeled data to address the potential for overfitting to specific domains with pseudo-labeling.
CoTracker and CoTracker3 are significant point-tracking advancements, mainly when dealing with long video sequences and challenges like occlusions.
CoTracker introduced the innovative concept of joint tracking through a transformer network, achieving state-of-the-art performance but relying heavily on synthetic training data. CoTracker3 builds upon this foundation, introducing architectural simplifications, a novel semi-supervised training pipeline, and improved efficiency. By leveraging multiple pre-trained trackers as teachers to generate pseudo-labels for real-world videos, CoTracker3 significantly reduces the dependency on synthetic data. It achieves even better accuracy with a thousandfold reduction in real data requirements.
Both models highlight the power of considering dependencies between tracked points and utilizing context to enhance tracking accuracy.
Next Steps
Now that you’re up to speed on the task of point tracking and the CoTracker3 model, it’s time to get hands-on with some code!
Check out this blog where you’ll learn about:
- Online vs. Offline Modes: Distinguish between CoTracker3’s online (real-time, forward-only) and offline (bidirectional, better accuracy but memory-intensive) modes.
- Running Inference: Learn how to download a pre-trained CoTracker3 model and run inference on video data using PyTorch.
- Visualizing Results: See how to visualize the tracked points and their visibility over time.
- FiftyOne Integration: Understand how to parse and integrate CoTracker3’s output into FiftyOne, a powerful dataset visualization and analysis tool, for exploring and interacting with the tracking results.
- Memory Management: Learn practical tips for managing GPU memory when working with large video files, including pre-processing techniques like frame sampling and rate reduction.