Editor’s note – This is the first article in a two-part series:

Part 1 – CoTracker3: Enhanced Point Tracking with Less Data (this article) – a primer on point tracking and CoTracker3
Part 2 – CoTracker3: A Point Tracker Using Real Videos – a hands-on tutorial on how to run inference with the model and parse the output into FiftyOne format

A new semi-supervised approach achieves state-of-the-art performance with a thousandfold reduction in real data requirements.

Introduction to Point Tracking

In computer vision, point tracking estimates the movement of specific points within a video over time. In this task, a model is trained to identify and maintain correspondences between tracked points across multiple frames in a video.

However, accurately tracking points in videos isn’t a trivial task.

First, you have to worry about occlusions. This is a massive pain. I'd put this as the primary challenge for point trackers because objects in a scene can obstruct the view of tracked points, leading to inaccurate tracking. Second, a tracked point can leave the field of view. As objects and the camera move, tracked, points will go with them, often out of the camera's field of view. This makes it difficult for some trackers to predict their trajectory. Third, you need to consider the appearance of points. Lighting and perspective changes can affect how a point looks. Rapid movements can make tracking difficult. Points may appear larger or smaller as objects move closer or farther from the camera. Last but not least is computational complexity. Computational complexity becomes a bottleneck when tracking many points simultaneously, especially considering dependencies between tracks.

The Applications of Point Tracking That Make it a Problem Worth Tackling

Despite what some folks may think, computer vision is not a solved problem, and the point tracking task has various applications that push the progress in other areas of computer vision. For example:

Motion Analysis:
3D Reconstruction:
Video Editing and Special Effects:
Robotics and Autonomous Navigation:

A Brief History of Point Tracking

Historically, there have been a few dominant paradigms for this task.

Optical Flow

Optical flow models estimate dense instantaneous motion. They can be classical approaches, deep learning methods, or transformer-based methods.

Classical Approaches:
Deep Learning-Based Methods:
RAFT and its Variants:
Transformer-Based Methods:

Optical flow methods are powerful for estimating dense instantaneous motion but are not ideal for long-term point tracking due to error accumulation.

Multi-Frame Optical Flow

Multi-frame optical flow models extend optical flow to multiple frames but are still not designed for long-term tracking or occlusion handling. While initial attempts to extend optical flow to multiple frames relied on Kalman filtering for temporal consistency, more recent approaches include:

Modern Dense Flow Methods:
Multi-Flow Dense Tracker (MFT):

The methods can estimate flow across multiple frames but are not designed for long-term point tracking and struggle with handling occlusions, especially when points are occluded for a long time.

Point Tracking

Point tracking models track sparse sets of points over time. Unlike optical flow, which estimates dense instantaneous motion, point tracking focuses on a sparse set of points and tries to maintain their correspondence across multiple frames. Some models that follow this paradigm include:

Particle Video:
PIPs:
TAP-Vid:
TAPIR:
PIPs++:
OmniMotion:

Many point tracking models, like PIPs and PIPs++, track points independently; however, points in a video often exhibit strong dependencies, such as belonging to the same object. The original CoTracker model uses these dependencies by performing joint tracking of many points.

CoTracker is a transformer-based model that tracks many 2D points in extended video sequences. Unlike previous methods that typically track points independently, CoTracker introduces the concept of joint tracking. This means paying attention to dependencies between tracked points, leading to enhanced accuracy and robustness, mainly when dealing with occlusions or points moving out of the camera view.

Here are the new techniques that CoTracker developed to improve on previous methods:

However, despite its impressive performance, CoTracker has some limitations:

Introducing CoTracker3

Building upon the foundation laid by CoTracker, CoTracker3 has a simpler architecture, improved data efficiency, and greater flexibility.

While maintaining the core concept of joint point tracking, CoTracker3 refines and streamlines various aspects, achieving state-of-the-art results with significantly less training data. Here are some of the improvements and innovations introduced by CoTracker3 that I feel are interesting:

Architectural Simplifications and Enhancements

Semi-Supervised Training Pipeline

Performance Improvements

Limitations

Despite its advancements, CoTracker3, like its predecessor, has its limitations:

CoTracker and CoTracker3 are significant point-tracking advancements, mainly when dealing with long video sequences and challenges like occlusions.

CoTracker introduced the innovative concept of joint tracking through a transformer network, achieving state-of-the-art performance but relying heavily on synthetic training data. CoTracker3 builds upon this foundation, introducing architectural simplifications, a novel semi-supervised training pipeline, and improved efficiency. By leveraging multiple pre-trained trackers as teachers to generate pseudo-labels for real-world videos, CoTracker3 significantly reduces the dependency on synthetic data. It achieves even better accuracy with a thousandfold reduction in real data requirements.

Both models highlight the power of considering dependencies between tracked points and utilizing context to enhance tracking accuracy.

Next Steps

Now that you’re up to speed on the task of point tracking and the CoTracker3 model, it’s time to get hands-on with some code!

Check out this blog where you’ll learn about:

Online vs. Offline Modes:
Running Inference:
Visualizing Results:
FiftyOne Integration:
Memory Management:

Talk to a computer vision expert