Most computer vision models perceive motion — they tell you how things moved after the fact. MolmoMotion does something harder and more useful: it forecasts motion. Given a single frame, a few points marked on an object, and a plain-language instruction like "move and rotate the wooden bowl with fruit on the table," it predicts where those points will travel in 3D over the next few seconds. That's the difference between a system that narrates the past and one that can plan: a robot reaching for a cup needs to anticipate how the cup will move before it touches it, and a video generator needs to know what plausible motion comes next.
MolmoMotion, from Ai2, is open all the way down — model weights, the million-clip training set, and a purpose-built benchmark called PointMotionBench for measuring object-centric 3D motion forecasting. The benchmark is the part this post is about, because a benchmark you can't see is a benchmark you can't really reason about. PointMotionBench ships as a pile of .npz track files and JSON captions across three source datasets (DAVIS, HOT3D, WorldTrack). That's the right format for training and scoring — and a terrible format for building intuition. So let's load it into FiftyOne and actually look at it.
Why explore a benchmark in FiftyOne?
Reading a benchmark's metric table tells you a model's average displacement error to three decimals. It tells you nothing about what the benchmark contains, where the motion is hard, or whether this data even resembles your problem. FiftyOne closes that gap by turning the raw tracks into a browsable, filterable, scrubbable dataset. We’ve created a notebook that does a few concrete things:
Scrub the ground truth. Open any clip and the benchmark's tracked query points are attached per frame, so as you play the video the points move with the object. Watch the 8 points ride a black swan's neck as it swims, or follow a mountain biker through a jump. This single interaction makes the forecasting task concrete in a way no equation does — you see exactly what the model is asked to predict, and what "correct" looks like.
Filter by what you care about. The loader carries each clip's caption and its object category — pulled straight from the benchmark's object-keyed tracks (for example "racing_car") — into FiftyOne sidebar fields. Want every clip involving a particular animal or vehicle? Filter category. The benchmark's diversity — animals, cars, people, sports — stops being an abstraction and becomes something you can slice. There's also a split field marking each clip's source dataset; in this DAVIS-only setup it shows a single value, but it's what lets you separate DAVIS's third-person scenes from HOT3D's egocentric ones once those splits are loaded (more on that below).
Read the data's character, including its warts. Browsing the grid, you quickly notice the texture of real-world data: clean tracks on a clear-moving subject, and the occasional tracker drift near reflections or occlusions. PointMotionBench even ships per-point trust weights and keep-masks for exactly this reason, and seeing where tracking gets noisy is itself a lesson in why 3D motion annotation from unconstrained video is hard.
Stage the evaluation before you run it. Even before any model output exists, the notebook saves the views the workflow needs — a missing_predictions filter and a per-split view — so the structure is in place. The "worst predictions" sort is created automatically as soon as predictions are added (until then it simply isn't there, since there's nothing to rank). The moment you cache model outputs, surfacing the model's worst cases becomes one click away.
The payoff is a shift from trusting a benchmark to understanding it. You explore the task, the inputs, the diversity, and the difficulty directly — and you do it in the same tool you'll use to evaluate the model.
Got GPU? Take it further and add predictions
What the notebook demonstrates as-is is the benchmark and the task, fully explored. What it doesn't yet show is the model doing anything — and that's the natural next step.
To get there, run MolmoMotion offline on the same clips. A companion script reads each PointMotionBench clip, feeds the first frame, caption, and query points to the model, and caches the predicted 3D trajectory as one .npz per clip. (MolmoMotion uses a Molmo 2 backbone — a multi-billion-parameter VLM — so this is GPU work, deliberately separated from the notebook so the exploration above stays fast and laptop-friendly.)
Once those predictions exist, the demo comes alive:
Predicted tracks render right alongside ground truth, per frame, so you can scrub a clip and watch the model's forecast diverge from — or hug — the real motion.
Every clip gets a 3D Average Displacement Error, the benchmark's official metric, computed against its ground-truth track.
The "worst predictions" view appears, instantly surfacing the clips where the model struggles most — the single most valuable thing FiftyOne does for model evaluation. Instead of an average, you get the specific failures, with the video right there to explain why.
That's the full arc: predicted vs. ground-truth motion, side by side, sortable by error, browsable by object and motion type. The benchmark exploration is the foundation; the prediction overlay is the model demo built on top of it.
A note on scope
The notebook is configured for the DAVIS split to start — the third-person, in-the-wild portion of PointMotionBench, and the smallest and least-gated to obtain. That's 90 clips, which is plenty to demonstrate the workflow. The other two splits, HOT3D (egocentric object manipulation, captured on Project Aria glasses) and WorldTrack (egocentric plus studio scenes), are where the robotics and multi-view stories live; they require their own dataset access and reconstruction steps, and slot into the same pipeline when you're ready.
Next steps
Browse the benchmark. Load the DAVIS clips, toggle the per-frame points, and filter by category and caption. Build intuition for the task before you touch the model.
Run inference on a GPU to produce cached predictions, then reload — the per-frame predicted tracks, ADE metrics, and worst-case views populate automatically.
Expand to HOT3D and WorldTrack to cover egocentric and studio motion once you have access to those sources.
Resources
MolmoMotion blog post (Ai2): https://allenai.org/blog/molmo-motion
The DAVIS sequences are licensed CC BY-NC 4.0 (non-commercial, with attribution); PointMotionBench is provided for research under Ai2's Responsible Use guidelines. Each underlying source dataset carries its own terms — review them before use.