D4RT: Inside CVPR 2026's Best Paper on 4D Scene Reconstruction

D4RT (Dynamic 4D Reconstruction and Tracking) won best paper at the most competitive venue in computer vision by replacing an entire pipeline of specialized models with a single elegant query interface. D4RT, is a 4D scene reconstruction model from Google DeepMind, University College London, and the University of Oxford that recovers the geometry and motion of a dynamic scene from video using a single query interface, and it won the best paper award at CVPR 2026.

We built a notebook using the open source FiftyOne toolkit to make that idea tangible — loading real street footage, layering simulated depth heatmaps and point tracks directly onto video frames, and making the paper's core claim that dynamic objects are no longer a special case, something you can watch rather than just read. This post explains the paper, what makes it groundbreaking, and what the notebook built using FiftyOne demonstrates that a static paper figure cannot.

Key takeaways

D4RT (Dynamic 4D Reconstruction and Tracking) is a unified, efficient, feedforward model from Google DeepMind, University College London, and the University of Oxford that won the best paper award at CVPR 2026.
It replaces the traditional multi-model 4D reconstruction pipeline (separate depth, optical flow, and camera-pose models) with a single query interface, encoding a video once and answering depth, point-tracking, and camera-pose queries from the same latent representation.
Its core conceptual contribution is treating dynamic objects the same way as static ones, with no special case, no test-time optimization, and no fusion step.
D4RT sets a new state of the art across every 4D reconstruction benchmark it was tested on, including scenes where prior methods like VGGT fail to track moving objects.
The model weights are not yet publicly released. The FiftyOne notebook in this post uses physically plausible simulated outputs grounded in real detections, and becomes a true reproduction tool with a one-function change once weights ship.

Congratulations to the D4RT Team

Last week at CVPR 2026 — the field's most competitive computer vision conference — the awards committee reviewed over 16,000 submitted papers and selected two for top honors. One of them was "Efficiently Reconstructing Dynamic Scenes One D4RT at a Time" by Chuhan Zhang, Guillaume Le Moing, Skanda Koppula, Ignacio Rocco, Liliane Momeni, Junyu Xie, Shuyang Sun, Rahul Sukthankar, Joëlle Barral, Raia Hadsell, Zoubin Ghahramani, Andrew Zisserman, Junlin Zhang, and Mehdi Sajjadi from Google DeepMind, University College London, and the University of Oxford.

It is a richly deserved recognition. This is the kind of paper that earns a best paper award not just by moving numbers on a leaderboard — though it does that too — but by making the field ask why it took this long to think about the problem this way.

What is D4RT and how does it work?

4D scene reconstruction — recovering the geometry and motion of a dynamic environment from video — has historically been treated as a pipeline problem. You assemble a chain of specialized models: one for depth, one for optical flow, one for camera pose estimation. You fuse their outputs with optimization steps designed to enforce geometric consistency across what are fundamentally incompatible representations. The result is slow, brittle, and tends to fail precisely where it matters most: on moving objects.

D4RT replaces the entire pipeline with a single question.

The model encodes a video once into a compact latent scene representation. You then query that representation with a five-tuple — a pixel location (u, v), a source timestamp t_src, a target timestamp t_tgt, and a camera reference frame t_cam — and the model returns a 3D position. That's the complete interface. By varying which timestamps you provide:

Set all three timestamps equal → depth map for that frame
Fix (u, v, t_src) and vary t_tgt → point track for that pixel across time
Vary t_cam → camera parameters for each frame
Query every pixel → dense point cloud

One architecture. One forward pass. No test-time optimization. No separate fusion step.

Why D4RT won CVPR 2026's best paper

The performance numbers are strong — D4RT sets a new state of the art across every 4D reconstruction benchmark it was tested on. But the CVPR committee doesn't give best paper awards purely for benchmark wins. The contribution here is conceptual as much as empirical.

D4RT refuses to treat dynamic objects as a special case. Prior leading methods either skip moving objects entirely when establishing correspondences — VGGT is a notable example — or handle them through expensive iterative refinement that compounds latency and error. D4RT applies the same query, the same decoder, and the same forward pass to a pedestrian crossing the street and the building behind them. The architecture makes no distinction. That uniformity is the insight.

The query interface decouples what you ask from how much you decode. Traditional dense reconstruction methods compute outputs for every pixel in every frame whether you need them or not. D4RT's on-demand querying means you can ask about a handful of tracked points at inference time and pay only for those queries. Sparse or dense, static or dynamic — the same model handles all of it through the same interface.

The unified representation makes implicit things explicit. Because depth, correspondence, and camera pose all flow from the same latent scene encoding, they are by construction geometrically consistent with each other. There is no fusion step because there is nothing to fuse. This is an architectural property that prior pipelines simply cannot replicate by design.

Exploring D4RT outputs in FiftyOne

Before describing what the notebook does, it is important to be clear about what it does not do.

The D4RT model weights have not yet been publicly released. The notebook contains no D4RT inference code. Every depth map, point track, camera parameter, and motion mask in the notebook is produced by a simulation — a few dozen lines of NumPy math that generates physically plausible outputs grounded in the real COCO detections present in the footage, based on reading the paper's descriptions of what D4RT produces.

The notebook's purpose is to be a reading companion. Its goal is to give someone engaging with the paper a set of interactive visual anchors for the concepts being described — to make the paper's output domain feel tangible before the model itself is available to run. Think of it the way a well-labelled diagram helps you understand an algorithm: not by running the algorithm, but by making its structure visible in a way that prose cannot.

When the model weights are released, swapping the simulation functions for real inference is a one-function change. At that point the notebook becomes a genuine reproduction tool. Until then it is an honest illustration, and we think that is still worth having.

How each D4RT capability maps to its FiftyOne representation and key insight

How each D4RT capability maps to its FiftyOne representation and key insight
D4RT Concept	FiftyOne Representation	Key Insight
Unified query interface	Single dataset, multiple label types per frame	One model -> depth + tracks + cameras
Feedforward depth decoding	fo.Heatmap per frame	Objects form near-depth spike
Dynamic point tracking	fo.Keypoints grounded in COCO detections	10x more displacement than static
Static/dynamic separation	fo.Segmentation motion mask	Emerges from depth, no seg model needed
Method comparison	Per-clip AJ/AbsRel fields + scatter plot	Gap grows with dynamic content
Camera recovery	Per-frame rotation/translation + 3D plots	No test-time optimisation required

This table maps the six core capabilities of D4RT to how each one is represented in the FiftyOne companion notebook and the insight it reveals. D4RT's unified query interface becomes a single dataset carrying depth heatmaps, dynamic and static point tracks, motion masks, and per-frame camera parameters on the same video frames, so depth estimation, point tracking, and camera recovery all come from one model with no test-time optimization. The notebook's comparison fields make the headline finding visible, which is that D4RT's advantage over baseline methods like VGGT widens as scenes become more dynamic.

How to explore D4RT's depth, tracks, and camera poses in FiftyOne

As mentioned, FiftyOne is an open-source computer vision toolkit built around a simple idea: model outputs should be explorable, not just measurable. It is particularly well suited to a paper like D4RT, where the most important claims are spatial and temporal — things that live in video frames, not in summary statistics.

Here is what the notebook demonstrates that a static paper cannot.

The unified interface as something you experience

The paper describes D4RT's query formulation in a paragraph and a diagram. In FiftyOne, depth maps, point tracks, motion masks, and camera parameters all live as separate toggleable fields on the same video frames in the same viewer. Enabling and disabling frames.d4rt_depth, frames.d4rt_dynamic_tracks, frames.d4rt_static_tracks, and frames.d4rt_motion_mask from the same sidebar panel — on the same clip, without switching models or views — is the most direct way to feel the unification argument the paper is making. The outputs co-exist because they were co-produced. That relationship is invisible in a table and visible the moment you open a clip.

Dynamic and static tracks, simultaneously, on real footage

The paper's headline result — that D4RT establishes correspondences through moving objects where prior methods cannot — is reported as an AJ score difference. In FiftyOne it becomes something you watch. Cyan keypoints follow the bounding-box centres of real detected cars and pedestrians across frames, showing substantial displacement. Yellow keypoints on background points drift only slightly, following the slow camera pan. Both are visible at the same time, on the same real street footage, in the same video timeline. The difference in displacement is not a metric you have to interpret — it is directly visible. VGGT's equivalent output would contain no cyan points at all. That absence is not something a table communicates.

Depth as implicit segmentation, verified in three layers

The paper does not claim D4RT is a segmentation model. But a natural consequence of unified depth estimation is that near objects cluster in a distinct depth band. FiftyOne makes this a first-class observation you can verify: enable frames.d4rt_depth to see the inferno colorscale heatmap where moving objects appear as dark patches against a bright far-background gradient; enable frames.d4rt_motion_mask and watch a red overlay sit precisely over those same dark regions; keep frames.detections active and see the original COCO bounding box confirming an actual object sits there. The mask is simply a threshold on the depth values already shown in the heatmap. No additional model. No additional inference. Toggle the mask off and on over the heatmap and the threshold relationship becomes undeniable.

The performance gap as a filter, not a footnote

Every practitioner reading a benchmark table asks the same question: where does the method win, and what do those scenes actually look like? FiftyOne answers it directly. The d4rt_aj_advantage field is stored at the sample level, so you can sort clips by it and immediately navigate to the scenes where the gap between D4RT and the baselines is widest. Those clips turn out to be the busiest — highest mean_dets_per_frame, highest dynamic_coverage_pct. The scatter plot in the notebook shows this correlation explicitly. The filter in the App makes it navigable. You are not reading about a relationship between scene dynamics and model performance; you are exploring it on real footage frame by frame.

Camera recovery made concrete

The 3D trajectory plots show D4RT's recovered camera pose across all ten clips simultaneously, with forward-direction arrows rendered per frame. The camera motion versus dynamic coverage scatter plot lets you verify that D4RT is disentangling camera motion from object motion — a known failure mode of simpler methods is to conflate the two, which would show up as clips with high object coverage also having anomalously high camera motion estimates. Looking at the scatter, you can check for that pattern without writing a single line of evaluation code.

What the notebook accomplishes

The notebook runs on FiftyOne's quickstart-video dataset — 10 real urban street clips, CC-BY-4.0, ~50 MB, one-line download — and builds in eleven steps from raw footage to a fully labelled interactive dataset.

It attaches to every sample:

16-bit depth heatmaps rendered with the inferno colorscale
Binary motion masks derived from the depth threshold
Cyan dynamic keypoints tracking real COCO-detected objects across frames
Yellow static keypoints anchored to background points
Per-frame camera translation and rotation parameters
Per-clip AJ and AbsRel metrics for D4RT, SpatialTrackerV2, and VGGT

It produces inline visualisations:

RGB-versus-depth grids showing the near/far gradient on real footage
Depth histograms comparing static and dynamic clips
Track trail renderings showing cyan and yellow displacement side by side
Method comparison bar charts and a dynamic-content-versus-advantage scatter plot
3D camera trajectory plots for all ten clips simultaneously

It launches the FiftyOne App with:

Summary fields creating range sliders for dynamic_coverage_pct, camera_tx, and camera_tz in the sidebar
All label fields toggleable from a single sidebar panel on the same video frames

Next steps

When D4RT weights are released, the swap from simulation to real inference is isolated to the four simulate_* functions in Cell 13. Everything else — the FiftyOne dataset schema, the label fields, the App colour scheme, the sidebar filters — stays exactly as it is. The companion becomes a reproduction tool with a one-function change.

To go deeper on the benchmark, download PointOdyssey or Sintel and load them into FiftyOne using the same label schema. These are the actual datasets D4RT is evaluated on in the paper. With real model outputs attached, you can reproduce Tables 2 and 3 interactively — filtering by scene complexity, occlusion level, and motion magnitude rather than reading aggregate numbers.

To extend the comparison, the spatialtracker_aj and vggt_aj fields are already on every sample. When SpatialTrackerV2 and VGGT outputs are available on the same clips, plugging them in as additional keypoint fields turns the bar charts into a live side-by-side comparison in the App viewer — you can watch all three methods' tracks simultaneously on the same frame.

To use this as a teaching tool, the notebook is structured so each step is independently runnable. Steps 1–3 work without any D4RT knowledge. Steps 6–7 make good standalone demonstrations of what depth estimation and point tracking look like before getting into the paper's technical details. Step 11 gives a complete interactive dataset that can anchor a reading group or lecture without requiring anyone to have run the paper's code.

The full notebook is available here. The paper is at arXiv:2512.08924. The dataset used is FiftyOne's quickstart-video, loadable with a single line of code.

Try it yourself. FiftyOne is the open source toolkit that turns model outputs into something you can explore frame by frame, not just measure. Clone the D4RT companion notebook, run it on the quickstart-video dataset in one line, and toggle depth, tracks, and motion masks on real footage. When the D4RT weights drop, you will be one function away from reproducing the paper's results interactively. Get started with FiftyOne.

FAQ

What is D4RT? D4RT (Dynamic 4D Reconstruction and Tracking) is a 4D scene reconstruction model from Google DeepMind, University College London, and the University of Oxford. It recovers the geometry and motion of a dynamic scene from video using a single query interface, and it won the best paper award at CVPR 2026.

What does D4RT stand for? D4RT stands for Dynamic 4D Reconstruction and Tracking. It is a unified, efficient, feedforward method, and the name is a play on "4D" and "dart," reflected in the paper title "Efficiently Reconstructing Dynamic Scenes One D4RT at a Time."

What was the best paper at CVPR 2026? One of the two best papers at CVPR 2026 was "Efficiently Reconstructing Dynamic Scenes One D4RT at a Time," selected from over 16,000 submissions. It introduces D4RT, a single-model approach to dynamic 4D reconstruction.

How is D4RT different from VGGT? VGGT and similar methods either skip moving objects when establishing correspondences or handle them with expensive iterative refinement. D4RT applies the same query, decoder, and forward pass to dynamic and static content alike, so it can track correspondences through moving objects where VGGT cannot. D4RT also reports state-of-the-art results across the 4D reconstruction benchmarks it was tested on.

How does D4RT work? D4RT encodes a video once into a compact latent scene representation, then answers queries defined by a pixel location, a source timestamp, a target timestamp, and a camera reference frame, returning a 3D position. Varying which timestamps you provide yields depth maps, point tracks, camera parameters, or a dense point cloud, all from one model in a single forward pass with no test-time optimization.

Are the D4RT model weights available? Not yet. As of publication the weights have not been publicly released. The FiftyOne notebook in this post visualizes simulated outputs grounded in real detections so the paper's ideas are explorable today, and it is designed to swap to real inference with a one-function change when weights are released.

Citation

Zhang, C., Le Moing, G., Koppula, S., Rocco, I., Momeni, L., Xie, J., Sun, S., Sukthankar, R., Barral, J., Hadsell, R., Ghahramani, Z., Zisserman, A., Zhang, J., and Sajjadi, M. (2025). Efficiently Reconstructing Dynamic Scenes One D4RT at a Time. arXiv:2512.08924.

Seeing the 4th Dimension: D4RT and CVPR 2026's Best Paper, Explored in FiftyOne

Talk to an AI expert