What Nine CVPR 2026 Papers Reveal About Trust, Robots, and Human-Likeness

Growing up, my favorite place in the entire world was ShowBiz Pizza Place. For the uninitiated, ShowBiz is like Chuck-E-Cheese, but infinitely better. Sitting at a long cafeteria table in a dark room and eating pizza while The Rock-afire Explosion, an animatronic band consisting of make-up-wearing forest animals (and, inexplicably, a gorilla) played catchy songs, was my happy place.

I loved the drama, the anticipation, and most of all, the animatronics. I loved that I knew they weren’t real and I loved that I could pretend that they were real. But my love for robots extended beyond the concrete walls of our local strip mall and into the real world, from the Teddy Ruxpin doll I carried around until its stuffing fell out, to lowering my voice and saying “I’ll be back” in a terrible Austrian-German accent before I even knew what the T-800 was, to bestowing my first Roomba with the name Sebastian.

While these experiences are uniquely my own, mechanized robots have been a part of the cultural zeitgeist since Karel Čapek introduced the term “robot” in his play, Rossumovi Univerzální Roboti (Rossum's Universal Robots), in 1921. While Čapek’s robots were a vessel for human anxiety over mass industrialization (and the play ended with a literal robot uprising) we’ve seen robots span the spectrum of evil to good, from Ultron and The Borg Collective on one end, some replicants and Marvin the Paranoid Android occupying something of a middle ground, and robots like WALL-E, the Iron Giant, and of course C3-PO and R2D2 on the “I would adopt this robot and care for it like my own” end of the spectrum.

And with how fast physical AI is moving, maybe that dream of robot adoption isn’t so far off.

What animation taught me about human movement

But before I get into detailing my plans for my personal backyard animatronic pizza cafe, we need to detour into the world of animation, because animation taught me how to see. For the last six years and counting, I’ve studied and practiced animation alongside 3D rigging and modeling. At its core, animation relies on the 12 principles created by Ollie Johnston and Frank Thomas, Disney legends who animated classics like The Jungle Book and Lady and the Tramp.

The 12 principles were designed to act as guidelines that incorporated weight, anticipation, and physics to create more believable and engaging animation. The 12 principles have stood the test of time because they engage the brain as a collaborator: animators aren’t just drawing a frame-by-frame sequence of events, they’re manipulating timing, exaggeration, and appeal to communicate movement and intent. Timing and spacing, in particular, are critical because the story lives in the gaps.

Take the humble bouncing ball (the “hello world” of animation), where the timing is defined by the ball hitting the ground, and the spacing of the ball indicates how fast the ball is moving between bounces. In this illustration from The Animator’s Survival Guide, we see that at the apex of a bounce, the ball is drawn overlapping, which, when played in sequence, makes the ball appear to hover—exactly like we would expect it to in the real world. And then by adding more space between the ball as it ascends and descends, our brain fills in the gaps and interprets them as speed.

2D bouncing ball animation showing timing and spacing using transparent orange balls that overlap near the apex of the bounce and spreading out as the ball gets closer to the ground. — Standard 2D bouncing ball illustration showing timing and spacing

Contrast this with rotoscoping, an animation technique where stylized illustrations are drawn over footage of human actors with the aim of creating realistic animation. In theory, rotoscoping should be the pinnacle of animation, because its aim is to accurately capture human movement and expression frame-by-frame. But fully rotoscoped films have never quite had the same appeal as traditionally animated films because rotoscoping (in addition to being expensive and time-consuming) over-indexes on the wrong things. It’s so faithful to human movement and expression that it flattens it.

Compare A Scanner Darkly, a rotoscoped sci-fi thriller starring Keanu Reeves, and Miyazaki’s Spirited Away. Despite being highly stylized, the world Miyazaki has created is more emotionally true than the one in A Scanner Darkly because Miyazaki engages with the 12 principles and exaggerates based on a deep understanding of movement, whereas rotoscoping marches steadily along without deviation or surprise. The completely fabricated world of Spirited Away feels more accessible to us than the world literally traced from reality in A Scanner Darkly.

So, where does this leave us in the world of physical AI?

Reading CVPR 2026 physical AI papers through a different lens

I had originally started this piece as a round-up of physical AI papers that had been accepted to the Conference on Computer Vision and Pattern Recognition (CVPR), thinking that I could approach the round-up by looking at the human-ness of robots as a trust factor and explore where in physical AI human-ness mattered, and where it didn’t. But the more I read, the more my question morphed from “where does human-ness matter in physical AI” to something weirder and more interesting.

I’ll be honest: I did not read every paper that has been accepted to CVPR 2026. There are over 4,000 of them, and even winnowing the pool down to those focused on physical AI, reading all of them before the conference would be an impossible task. So I selected those that initially drew my attention because they were exploring the human-ness of robots in some manner. If I’ve missed a paper, I want to hear about it. You can always find me in the Voxel51 Discord server.

But the papers I did read fell into two categories: robots learning to be readable to the humans around them, and robots being built around incompleteness rather than in spite of it.

Robots learning to be readable

The first set of papers asks what it takes for a robot to function alongside humans: not to pass as one, but to be legible to one. In motion, in social behavior, in emotional responsiveness. In Blade Runner, the Voight-Kampff test wasn't about catching robots doing bad things. It was about finding the ones you couldn't read as replicants. These papers are working on making robots readable to humans, not to replace them, but to move through the same world as them.

Diagram illustrating the Robot Motion Turing Test. A human performing a kicking motion and a humanoid robot performing the same motion are both converted to skeletal SMPL-X representations. A human evaluator is shown the skeletons and asked 'Is this a human motion?' The human receives the answer Yes, the humanoid receives No. — Figure 1. From Towards Motion Turing Test: Evaluators judge whether the pose sequence resembles human motion, focusing solely on motion without appearance cues.

The most literal version of that test comes from Towards Motion Turing Test: Evaluating Human-Likeness in Humanoid Robots, which asks a deceptively simple question: can human observers distinguish a humanoid robot's movement from a human's when appearance is removed entirely?

The researchers built HHMotion, a dataset of 1,000 motion clips from 11 humanoid robots and 10 human performers across 15 action categories, stripped of all visual cues using a body model called SMPL-X, and asked 30 annotators to score each clip on a 0–5 scale of human-likeness, from 'completely inhuman-like' to 'indistinguishable from human motion.' The results are telling. Robots come closest to fooling us on walking, but the gap is still 1.31 points on a five-point scale, which means humans rate human walks around 3.92 ('high similarity') and robot walks around 2.61 ('moderate likeness').

It’s not that the robots are fooling us, it’s that they're losing by less. And on jumping, boxing, running, and pingpong, they fail badly. Walking is a cycle, one of the first things animators learn to draw, because the rhythm is predictable and the beats are known. Boxing is the opposite. It's loaded with individual history, physical biography, the anticipation and follow-through of a specific body that has learned to move in a specific way. You can learn the pattern of a walk. You cannot fake a biography.

What's interesting is that RoboPerform, the first framework to drive humanoid locomotion directly from audio without an intermediate motion generation step, and detailed in Do You Have Freestyle? Expressive Humanoid Locomotion via Audio Control, isn't trying to fake a biography either.

Instead of asking how to make a robot move like a human, it asks a different question. What if movement has two components, content and style, and style can be driven by something external? This is something animators have known forever. A character walking is content. A tired walk, a sneaky walk, a happy walk, that's style, and it's where the personality lives. RoboPerform uses audio such as music and speech as the style signal, conditioning humanoid locomotion on sound rather than captured human motion. The result is a robot that can dance and move expressively without ever tracing a human performer. It's closer in spirit to the Rock-afire Explosion than to a motion capture session. The movement isn't human, but it has rhythm, personality, and presence. It engages you without pretending to be something it isn't.

Grid of navigation trajectory comparisons across three outdoor environments: a street crossing, a park, and a campus. For each environment, a satellite map shows the planned route in green versus a collision-prone route in red, followed by a series of first-person video frames comparing SocialNav trajectories against CityWalker, with warning symbols marking unsafe navigation events in the CityWalker rows. — Figure 4. From SocialNav. Qualitative comparison on the SocNav Benchmark. We visualize representative trajectories in three scenes (Crossing, Park, Campus). The left column shows top-down path views with our method (green) and the CityWalker baseline (red), where warning signs mark unsafe or socially improper behaviors. The right columns depict corresponding egocentric views: SocialNav remains on sidewalks and walkways, while the baseline often takes shorter but socially risky routes through restricted regions (such as driveways, dry streambeds, lawns, and green belts) or crashes into obstacles like glass walls and trees.

SocialNav: Training Human-Inspired Foundation Model for Socially-Aware Embodied Navigation extends the question of readability into social space. Navigation efficiency and social compliance are in tension, because a robot that optimizes purely for getting from A to B will cut across personal space, fail to yield, and move in ways that feel aggressive or unpredictable to the humans around it.

The paper builds a two-component system: a vision-language model that interprets social norms and context, and an action expert that generates physically plausible, socially compliant trajectories rewarded for smoothness and adherence to expert paths. The trajectories matter as much as the decisions, because human movement follows arcs, not straight lines, and that's one of the 12 principles for a reason. We telegraph turns. We slow before we redirect. We give the people around us time to read our intentions, which is anticipation, another of the 12 principles. A robot that pivots sharply has no anticipation and reads as threatening. A robot that moves smoothly and predictably reads as considerate. SocialNav is building the latter, and it's doing it in outdoor urban environments like parks, streets, and campuses, where the social stakes are highest and the rules are least explicit.

E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving takes the question somewhere more intimate: the inside of your car. It's a vision-language-action model for autonomous driving that reads not just what a passenger says, but the emotional register they say it in. "Stop here" and "stop here NOW" are semantically similar but emotionally different, and E3AD is built to hear that difference using a model that captures valence, arousal, and dominance from natural language.

The car isn't just executing commands, it's reading the room. Which raises a question worth sitting with: if a car responds to your anxiety by driving more carefully, does it matter whether it understands your anxiety or has simply learned that this pattern of words requires a certain response? From inside the car, you probably can't tell. And when it works, it feels like being heard.

These four papers are asking what it looks like for a robot to show up legibly in human space. But the second group of papers is asking something different, and I'd argue harder: what does it look like for a robot to know what it doesn't know? To recognize the edges of its own understanding, and instead of papering over them, invite a human in?

Robots built around incompleteness

These are the robots being built around incompleteness. And they're the ones that surprised me most.

The most direct version of that argument comes from When Robots Should Say “I Don’t Know”: Benchmarking Abstention in Embodied Question Answering. AbstainEQA starts from a surprisingly honest observation: 32.4% of real human queries to embodied agents are genuinely ambiguous or unanswerable given what the agent can perceive. The object isn't there. The information isn't visible. The task is impossible from the agent's current position. And yet the best models only manage appropriate abstention 42.79% of the time, compared to 91.17% for humans.

The paper identifies five categories of legitimate 'I don't know': actionability limitation, referential underspecification, preference dependence, information unavailability, and false presupposition. What strikes me about this is that instead of being a paper focused on failure, it's a paper about honesty. A robot that knows when to stop and say 'I can't answer that' is a robot you can learn to trust, because you know that when it does answer, it means it. But what happens when the failure mode goes in the other direction? A robot that abstains on everything is appropriately humble and also functionally useless, and the 91.17% human baseline assumes humans don't over-abstain, which they very much do.

Figure showing the Mistake Attribution task applied to an egocentric video of someone attempting to pick up a hammer. On the left, a sequence of video frames with a probability bar chart highlighting the Point of No Return at Frame 17. On the right, three attribution outputs: semantic attribution identifying the wrong object picked up, temporal attribution pinpointing Frame 17 as the Point of No Return, and spatial attribution showing a bounding box locating where in the frame the mistake occurred. — Figure 1. From Mistake Attribution. Mistake Attribution (MATT) task aims to understand the deviation between a human attempt (video) and the instruction (text) along three axes. Semantic attribution identifies what semantic role in the instruction is violated (e.g., a wrong Object “bolt” is mistakenly picked up instead of “hammer”); temporal attribution identifies when the attempt reaches the point of no return (PNR) (e.g., Frame 17); and spatial attribution identifies where, in the PNR frame, the mistake is manifested (e.g., the red bounding box).

Mistake Attribution: Fine-Grained Mistake Understanding in Egocentric Videos* takes a similar honesty and applies it to errors. Current systems can tell you a task failed. MATT wants to tell you what failed, when it became unrecoverable, and exactly where in the scene it went wrong. The researchers call that moment the Point of No Return: the frame past which correction is impossible. You can recover from chopping the vegetables unevenly, but you can't unburn the eggs. MATT builds a data engine called MisEngine that automatically constructs mistake datasets at scale, yielding two new benchmarks, EPIC-KITCHENS-M and Ego4D-M, that are the largest by an order of magnitude in the rapidly evolving mistake recognition literature, along with a unified model, MisFormer, that addresses all three attribution dimensions simultaneously. A robot that can locate its own Point of No Return is a robot that understands something true about the structure of mistakes. That's not a small thing.

* this paper is from the lab of Dr. Jason Corso, who is both the co-founder and current CSO of Voxel51.

VIRAL: Visual Sim-to-Real at Scale for Humanoid Loco-Manipulation takes a different angle on incompleteness, treating it as the structural reality of any real-world robot. The paper trains a privileged teacher policy in simulation with full access to ground-truth state, then distills a student policy that sees only RGB images and proprioception from the real robot. The teacher gets ground-truth because it can. The student never will, because the real Unitree G1 won't have it either. So instead of trying to recover the missing information, VIRAL closes the sim-to-real gap by training the student as if the world will definitely not look like the lab, using large-scale visual domain randomization over lighting, materials, camera parameters, and sensor delays, scaled across up to 64 GPUs. The result is a robot that completes 54 out of 59 consecutive loco-manipulation cycles without real-world fine-tuning, approaching expert teleoperation performance. The system works not because incompleteness is a feature, but because it is the operating condition, and VIRAL is designed around that fact rather than against it. The five failures across 59 cycles are the more interesting question. The 54 successes show the recipe works in expectation. What goes wrong in the 5 tells you where the recipe's edges are, and the paper would be stronger for naming them.

Diagram of the ManualVLA architecture showing two parallel pathways: a Planning Expert that takes current and goal images as input and generates a multimodal manual with text, position, and subgoal image outputs, and an Action Expert that takes robot state and noised action inputs to produce a sequence of actions. A cross-task shared attention mechanism connects the two pathways. A cross-task attention mask and three-stage training schedule are shown alongside. — Figure 2. From ManualVLA. Framework of ManualVLA. (a) To accomplish long-horizon tasks with defined goal states, we propose ManualVLA, a unified VLA model built upon a MoT architecture. The framework consists of two experts: a planning expert responsible for generating multimodal manuals, and an action expert responsible for predicting precise actions. The planning expert processes human instructions, the current image, and the final goal image to generate intermediate manuals that combine next-step image, positions, and sub-task instructions. We introduce an explicit CoT reasoning process, where each positional indicator serves as a visual prompt embedded into the observation of the action expert. (b) Along with the cross-task shared attention mechanism and the designed attention mask, the generated manual tokens are also used as conditioning signals for action generation, enabling an implicit CoT reasoning process that effectively guides the action expert. (c) ManualVLA adopts a three-stage training strategy that aligns the planning and action experts for effective collaboration.

ManualVLA: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation is the paper that most directly maps onto how humans actually learn to do complex things, and it's also the paper that maps most directly onto how Frank and Ollie taught animators to plan complex action. Given a final goal state of a completed LEGO structure or object rearrangement, ManualVLA doesn't map directly from perception to action. Instead it generates an intermediate procedural manual, a sequence of subgoals combining predicted images of the next scene state, pixel-level position coordinates, and natural language instructions, which then feed an action expert. This is pose-to-pose animation, where the animator plans the key poses first and fills in the inbetweens later, and it's used for exactly the same reason, because complex action gets unreliable when you have to figure it out frame by frame. The manual is also legible, meaning a human supervisor can read it and understand what the robot is planning before it acts. That legibility is a form of incompleteness, and it's the robot saying 'here's what I think the steps are, check my work.' ManualVLA achieves a 32% higher success rate than the previous hierarchical SOTA on LEGO assembly and object rearrangement, which suggests that the detour through legible planning isn't just philosophically interesting. It works.

Dexterous World Models closes the section with something that initially looks like pure capability research. The paper builds a scene-action-conditioned video diffusion model, where given a static 3D scene and an egocentric hand motion sequence, it predicts the dynamic changes the hand will cause before they happen. But what the paper is really doing is insisting that the hand and the world are a system. Animators call this follow-through and overlapping action. When a character pushes a cup, the cup doesn't just translate. The liquid sloshes, the surface registers contact, the character's arm decelerates against the resistance. The world reacts to the action and the action reacts to the world. DWM is doing this in reverse. It's predicting the reaction before the action commits. A robot that can simulate what its own actions would look like in a given scene before committing to them has a more honest relationship to its own uncertainty. It's not assuming the world will cooperate. It's modeling the ways it might not.

The perfect robot is an imperfect robot

When I started this piece I thought I was writing about how close we're getting to human-like robots, and we’re closer than I expected by some measures and farther than I expected by others. But the papers I kept coming back to weren't the ones closing the gap, they were the ones naming it and exploring it. They were the ones acknowledging that my motion isn't your motion, that this query is answerable and that one isn’t, that I burned the eggs at frame 17 and have to start over, and yes, I’m operating with less information than the teacher I was trained from.

Go back to the bouncing ball. The ball hovers at the apex of its arc because the animator drew it overlapping itself, not because anything about the physics demanded it. The closest you can get to a faithful copy of reality is rotoscoping, and rotoscoping is dead on arrival. The story lives in the gap between what was traced and what was added. The perfect imitation is the worst version. The deliberate exaggeration is the one your brain agrees to engage with.

When I look over at my dog and cats on the couch, I think about three unique species living comfortably with one another. None of us is pretending to be the others. We're legible to each other and we know where our gaps are. That's the model the more interesting papers at CVPR 2026 are reaching for. That’s the model I'd want any robot in my home to be built around. I don’t want a robot that hides what it doesn't know. Instead, give me a robot that hovers, deliberately, at the apex of its own uncertainty, and lets me read the gap.

What did I miss? Meet me in the Voxel51 Discord server and let's talk robots.

The Perfect Robot is an Imperfect Robot

Talk to an AI expert