Editor’s note – This is the second article in a two-part series:

Part 2 – CoTracker3: A Point Tracker Using Real Videos (this post) – a hands-on tutorial on how to run inference with the model and parse the output into FiftyOne format
Part 1 – CoTracker3: Enhanced Point Tracking with Less Data – a primer on point tracking and CoTracker3

How to make sense of the model outputs and parse them into FiftyOne format

CoTracker3 is designed to track individual points throughout a video sequence. Given a video and the initial location of a point in a specific frame, it predicts that point’s trajectory over time, even when the point is occluded or moves out of the camera’s view.

What makes CoTracker3 stand out from other point trackers' ability to effectively leverage real-world videos during training, resulting in SOTA performance on the point tracking task.

Most SOTA point trackers rely heavily on synthetic datasets for training due to the difficulty of annotating real-world videos. CoTracker3 overcomes this limitation using a semi-supervised training approach incorporating unlabeled real-world videos. This is achieved by employing multiple existing point trackers (trained on synthetic data) as “teachers” to generate pseudo-labels for the unlabeled videos.

CoTracker3, as the “student,” then learns from these pseudo-labels, effectively bridging the gap between synthetic and real-world data distributions. This strategy allows CoTracker3 to achieve SOTA accuracy on benchmark datasets while being trained on less real-world training data than previous methods.

I highly recommend reading the paper if you’re interested in all the nitty gritty details. In this post, we’re hands-on. I’ll show you how to run inference with the model and parse the output into FiftyOne format.

👨🏽‍💻 Let’s code!

Note: you can jump right into the Google Colab notebook or clone the repo and run it locally (assuming you have a GPU with enough RAM)

Start off with installing the required libraries:

Let’s download a dataset. I’ve got one here on Hugging Face you can download:

You can do an initial exploration of the dataset using the FiftyOne app:

There are a couple of ways you can use the model. One is cloning the CoTracker GitHub repository using the code there. The other is to download the model from the Torch hub. The model comes in two flavours: online and offline.

Online and Offline Modes in CoTracker3

Both online and offline versions of CoTracker3 have the same model architecture.

The difference is their training procedures and how they utilize temporal information at inference, specifically how they process the input video and the direction in which they track points.

CoTracker3 Online: Processes the video sequentially in a sliding window. It tracks points forward-only, making predictions based on previously seen frames. This mode enables real-time tracking for an indefinite duration, limited only by computational resources.

CoTracker3 Offline: Processes the entire video simultaneously as a single sliding window. This allows tracking points bidirectionally, leveraging information from both past and future frames. The offline has better performance, especially for tracking occluded points. This is because it interpolates trajectories through occlusions using the entire video context.

However, unlike the online version, the maximum number of frames it can process is limited by memory.

Online vs offline mode for CoTracker3

I’ll use the offline mode for this tutorial. You can download the model like so:

Let’s prepare a video for inference using the model.

From Pexels.com

And we can load this video as a tensor:

The model accepts a tensor with the following shape: B T C H W. Where B is the batch size, T is the number of frames, C is the number of input channels, H is the video height and W is the video width.

Let’s confirm the shape of the tensor:

Before running inference, let’s move the model and video to the GPU:

The grid_size parameter in CoTracker3 determines the number of points in a grid that will be tracked within a video frame. It offers a way to specify the density of tracked points across the video frame when you don't have specific points you want to track

When grid_size is greater than 0, the model computes tracks for a grid of points on the first frame. The grid will have grid_size * grid_size points.
This parameter is used in conjunction with queries and segm_mask.
If queries is not provided, and grid_size is set, the model will track points on a regular grid.
If a segm_mask is also provided, the grid points are computed only for the masked area.

Larger grid_size values consume more GPU memory.

When you increase the grid_size in the CoTracker model, it leads to higher GPU memory consumption.

The grid_size parameter determines the number of points that the model tracks. A larger grid_size means more points are being tracked, which requires more memory to store and process their information. If you find yourself encountering out-of-memory errors, consider reducing the values of grid_size
CoTracker uses a grid of points overlaid on the video frames. A larger grid_size results in more dense feature maps, which require more memory to compute and store.
With more points to track, the model needs to perform more computations, which can lead to increased memory usage for intermediate results and gradients (even though we’re using torch.no_grad()).

We can run inference as follows:

Examining the model output

The model returns two tensors. Let’s start by examining pred_tracks:

The pred_tracks tensor has the following shape: B T N 2

Where:

• B (Batch Size): This dimension represents the batch size.

• T (Time/Frames): This dimension corresponds to the number of frames in the video. It matches the number of frames in the input video tensor.

• N (Number of Points): This dimension represents the number of points being tracked. The number of points is determined by the grid_size parameter specifying how many points are sampled on a regular grid in the first frame. For example, if grid_size=20, then N would be 20x20 = 400.

• 2 (Coordinates): This dimension represents the x and y coordinates of each tracked point in the frame. Each point has two values corresponding to its position in the frame.

Now, let’s examine the pred_visibility tensor:

The pred_visibility tensor has shape: B T N 1

Where:

• B (Batch Size): Same as above, representing the batch size.

• T (Time/Frames): Same as above, representing the number of frames.

• N (Number of Points): Same as above, representing the number of points being tracked.

• 1 (Visibility): This dimension represents the visibility of each tracked point. It is a binary indicator for whether a point is visible in the frame.

We can use the code from the CoTracker repository to do some visualization:

You’ll notice that as the camera pans, the visibility of the points change

Using CoTracker3 with FiftyOne

Now that we understand how the model works, we apply it to our FiftyOne dataset and use the FiftyOne app to visualize the output. First, let’s clean up as much GPU memory as we can:

We’ll clone a repo I created to accompany this blog post:

Before running inference on the dataset, I want to discuss this codebase's workhorse: the function parsing model outputs into FiftyOne format.

Parsing CoTracker Output to FiftyOne Keypoints

I won’t go through the entire codebase with you, as the main inference code is the same. What I do want to touch on is how to parse the model output (i.e., pred_tracks and pred_visibility into FiftyOne format).

Let's take a look at the following function:

The create_keypoints_batch function takes the output from CoTracker and converts it into FiftyOne's keypoint format.

Here’s how it works:

Input:

results: A list of tuples, each containing:
pred_tracks: Predicted tracks for each point across all frames
pred_visibility: Visibility of each point across all frames
samples: A list of FiftyOne samples (video frames)

Processing:

For each sample (video) in the batch:

1. Normalize the coordinates:

CoTracker outputs pixel coordinates
These are normalized to [0, 1] range by dividing by frame width/height

2. For each frame in the video:

Create a list of fo.Keypoint objects
Each Keypoint represents a tracked point that is visible in the frame
The points attribute of Keypoint is a list with a single (x, y) tuple
The index attribute is set to the point's index in the tracking sequence

3. Create an fo.Keypoints object for each frame, containing all visible keypoints.

Output:

Each sample (video) in FiftyOne is updated with a “tracked_keypoints” field
This field contains an fo.Keypoints object for each frame

Key Points:

We don’t use the confidence attribute as CoTracker does not provide it
The label attribute is not used in this implementation
Visibility is binary (0 or 1) in the CoTracker output

Now, let’s run inference on the whole Dataset. Note that this will take ~1-2 minutes (assuming you’re using the A100 on Google Colab; when I ran this on my RTX 6000 Ada, it took ~3 minutes)

Setting some configurations for the FiftyOne app:

Output from my local machine using a grid_size of 50

Some lessons I learned during this project

This is the first point tracking model I’ve ever used, and throughout this process, I learned a lot about working with video data and how much GPU memory it consumes!

The dataset I was playing around with consisted of short videos I downloaded for free from Pexels. The original videos I downloaded were of relatively small file size (about 5 MB or so). Still, when I converted them to PyTorch tensors, they often took up several gigabytes of GPU memory. This is because the videos had high frames per second (fps) count, which meant that a 5-second video running at 30fps would be 150 frames, coupled with each of the tensors representing the outputs (pred_tracks and pred_visibility) that were often several gigabytes as well.

To overcome this challenge, I wrote a script to preprocess my videos. For videos longer than 10 seconds, the script samples every third frame and reduces the frame rate to 7 frames per second (fps), which helps decrease the file size and processing load while maintaining smooth playback. For videos that are 10 seconds or shorter, the script reduces the frame rate to 10 fps without additional frame sampling.

That may have been more reduction than necessary, but it allowed me to run inference on my whole dataset (on my local GPU, which has 48GB RAM) without blowing up GPU RAM while maintaining a relatively large grid_size of 50 (we already discussed the impact of grid_size on GPU memory above).

Note that the online version of the model is more memory efficient, but I wasn’t as happy with the results.

Next Steps

Let me know if you are interested in this model and want to see more about it.

A couple of things in v2 of this post could be using segmentation masks and query points. For example, one could run a zero-shot segmentation model across all the videos, get the segmentation masks for the videos, and use that to track the objects of interest.

If you enjoyed this post and want to discuss it further, join our Discord server!

Talk to a computer vision expert