AI Systems that See, Understand, and Act

Visual Agents represent a significant advancement in Visual AI and agentic AI.

These AI agents enable systems to perceive, understand, and interact with visual interfaces like humans do. As specialized Visual AI systems built upon Vision Language Models, they’re fundamentally changing how we approach the automation of visual interface interactions.

This year’s CVPR 2025 conference features some exciting research on Visual agents; these papers collectively signal that Visual Agents have moved from theoretical possibility to practical reality, representing one of the most exciting developments in applied AI.

Why This Research Wave Matters Now for Agentic AI

Recent advancements in foundational vision language models have finally provided the perceptual capabilities to tackle the long-standing challenge of GUI automation.

This timing is critical as these capabilities align with the growing need for AI systems that can navigate the increasingly complex digital world on our behalf. With the CVPR 2025 deadline approaching and the official CVPR 2025 dates now published, interest in these breakthroughs has surged. What makes this CVPR particularly exciting is the complementary nature of the accepted research. Different teams have simultaneously tackled distinct aspects of the visual agent challenge:

Novel architectures specifically designed for the interleaved nature of vision-language-action sequences
Techniques for efficiently processing high-resolution screenshots without losing critical details
Specialized methods for precise element grounding that enable reliable interaction
Approaches for managing interaction histories across multiple observation-action cycles
Systems demonstrating cross-platform compatibility from web to mobile interfaces

From Research to Practical Breakthroughs

Visual Agents are moving from academic curiosity to practical technology.

Digital interfaces are a part of nearly every aspect of work and life, and the ability to automate interactions with them becomes increasingly valuable. The timing of these breakthroughs couldn’t be more perfect. Early work in this area struggled with fundamental limitations—imprecise element localization, brittle performance across different interfaces, and limited action capabilities.

The progression toward generalist agents capable of working across diverse environments represents a crucial evolutionary step, potentially leading to systems that can handle various visual interface tasks with human-like adaptability. Meanwhile, the emergence of collaborative AI systems indicates a future where visual agents coordinate with other specialized AI components to tackle complex workflows.

The Visual Agent papers from CVPR I’m most excited about are:

From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
GUI-Xplore: Empowering Generalizable GUI Agents with One Exploration
SpiritSight Agent: Advanced GUI Agent with One Look
ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems

As we explore these papers in greater detail, we’ll see how they collectively represent incremental progress and a fundamental shift in what’s possible at the intersection of computer vision, natural language processing, and interactive systems. Most significantly, these advancements are happening just as the need for such technology is exploding.

Before diving into the research in these papers, let’s clearly understand what Visual Agents are and the capabilities that position them at the forefront of agentic AI. If you’re wondering what is agentic ai, these examples provide a compelling answer.

How are Visual Agents Different from Vision Language Models

A generalist Vision Language Model (VLM), or Multimodal Large Language Model (MLLM), is a foundation model primarily trained on vast quantities of paired text and image data.

These models excel at tasks like understanding images, answering questions about visual content, generating captions, or processing text found within images. Their strength lies in understanding and reasoning about visual and textual modalities.

On the other hand, a Visual Agent is a more specialized type of AI agent, typically built upon or adapted from a VLM/MLLM. Its specific goal is to perceive visual environments (like Graphic User Interfaces—GUIs) and act within them. These are often referred to as Vision-Language-Action (VLA) models. While they leverage the base VLM’s visual and language understanding capabilities, their core function is interacting with and controlling an environment based on visual observation and natural language instructions.

The key differences and reasons why a general VLM alone is insufficient for GUI agent tasks lie in the output modality, required specialized capabilities, and the nature of the task itself.

Output Modality: Actions vs. Text

General VLMs are primarily designed to output text. Visual Agents, however, must produce executable actions that manipulate the GUI, such as clicking, typing, or scrolling. This fundamental difference in output type means a general VLM, without significant adaptation, cannot directly control a GUI environment.

While general VLMs have visual understanding, GUI tasks demand specific skills that they typically lack or perform poorly on.

Element Grounding

A critical capability for a Visual Agent is accurately identifying and locating specific interactive elements (like buttons or input fields) within a GUI screenshot. Even the most powerful, general vision language models significantly struggle with element grounding from visual input alone. This is a major limitation because an agent cannot interact with elements it cannot reliably find. SpiritSight explicitly notes that the primary challenge in learning GUI navigation is learning the positional sub-policy needed for accurate grounding.

Processing High-Resolution GUI Inputs Efficiently

GUI screenshots are often high-resolution (e.g., 2K). Processing these high-resolution images results in very long token sequences, which is computationally expensive for models not specifically optimized for this. General VLMs may not have mechanisms like UI-Guided Visual Token Selection or Universal Block Parsing to efficiently handle UI visuals’ redundancy and structured nature while preserving necessary detail for grounding. This can lead to inefficiencies and high computational costs.

Managing Interleaved Vision-Language-Action History

GUI tasks often involve multi-step interactions where the agent needs to understand the context of previous observations (screenshots), user instructions, and actions taken. General VLMs are not inherently structured to effectively manage and reason over this complex, interleaved history of different modalities in a sequence. Techniques like Interleaved Vision-Language-Action Streaming are needed for this.

The Action-Perception Gap

The fundamental distinction between general Vision Language Models and Visual Agents lies in their core design purpose: understanding versus acting. This difference shapes everything from their architecture to their training objectives.

General VLMs excel at passive analysis — describing what they see, answering questions about visual content, or reasoning based on visual input. Their output is primarily textual interpretation. While impressively capable at understanding visual scenes, they operate fundamentally as sophisticated perception systems, not interactive agents.

Visual Agents, by contrast, are built for the dynamic loop of perception, decision, and action. They must understand what they’re seeing and use that understanding to make consequential decisions that change their operating environment. This requires a fundamentally different architecture optimized for:

Processing visual feedback resulting from their own actions
Maintaining state across multiple interaction steps
Generating precise, executable commands rather than descriptive text

The Action Space Challenge

Perhaps the most significant limitation preventing general VLMs from functioning as effective agents is their inability to generate structured, executable actions. Visual Agents require specialized output capabilities that translate understanding into precise commands:

CLICK(x=483, y=217)
TYPE("search query")
SCROLL_DOWN(amount=0.5)

These are structured commands with exact parameters that control interfaces. The action space varies significantly across platforms — web interfaces offer different interaction possibilities than mobile apps — requiring Visual Agents to adapt to diverse control paradigms.

General VLMs lack both the training to generate such structured outputs and the architectural components to precisely locate interactive elements. Without specialized training on interaction trajectories showing the relationship between observations and resulting actions, these models cannot develop the procedural understanding necessary for sequential decision-making.

The Missing Embodiment

At their core, general VLMs lack what we might call “embodied experience” — the fundamental understanding of how actions affect environments and how to leverage those effects to accomplish goals. This gap can’t be addressed through simple adaptations but requires specialized training regimes with interactive data, architectural modifications to support action generation, and mechanisms for maintaining context across interaction sequences.

The recent research on Visual Agent at CVPR 2025 is a significant leap forward — they bridge this fundamental action-perception gap that has long separated powerful understanding models from truly interactive AI systems.

From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons

This paper introduces the Generalist Embodied Agent (GEA), showcasing how Multimodal Large Language Models can be transformed into versatile agents capable of handling diverse real-world tasks. GEA is a huge advancement in creating Visual Agent systems that seamlessly operate across embodied AI, games, UI control, and planning domains.

Important links:

Paper on arXiv
Dataset: Not yet released
Model: Not yet released

This paper was interesting because it demonstrates how to create Visual Agents capable of real-world tasks, from object manipulation to game playing.

The GEA model is on LLaVA-OneVision, which the authors picked for its ability to handle long-context interactions. The model has a novel multi-embodiment action tokenizer that unifies diverse action types, and employs a two-stage training process combining supervised learning and reinforcement learning

Unified Model Architecture: GEA adapts a pretrained MLLM to process environmental context and predict appropriate actions across various domains.
Novel Action Tokenizer: A sophisticated tokenizer based on Residual VQ-VAE enables the model to handle discrete and continuous actions uniformly.
Two-Stage Training: The model undergoes supervised fine-tuning followed by reinforcement learning, using a massive dataset of 2.2 million trajectories compiled from diverse sources, including human demonstrations, learned policies, and motion planners.

Impressive Results

GEA demonstrates strong cross-domain generalization, achieving competitive or state-of-the-art results across diverse benchmarks:

Manipulation: Reaches 90% success rate in CALVIN (10% higher than comparable methods), outperforms baselines in Meta-World and Habitat Pick, though struggles with Maniskill’s challenging camera angles
Gaming: Achieves 44% of expert scores in Procgen (outperforming specialist models) and surpasses Gato in Atari
Navigation: Matches Gato in BabyAI despite using only visual inputs and fewer demonstrations
UI Control: Outperforms GPT-4o with Set-of-Mark prompting on AndroidControl
Planning: Nearly matches specialist RL systems on LangR tasks

Performance gains stem from cross-domain SFT training and targeted RL fine-tuning. Further improvements in Maniskill, Atari, and AndroidControl could be achieved by extending RL training to these domains.

Key Lessons for Practitioners

This paper gives us key insights and lessons learned specifically for fine-tuning a general VLM to become a visual agent.

To adapt a general VLM into a capable visual agent for diverse embodied tasks, the authors point towards leveraging a strong pretrained MLLM base, learning a flexible action representation, and training with a combination of large-scale supervised data from multiple domains and subsequent online reinforcement learning. The seven key lessons are as follows:

Start Strong: A pretrained multimodal language model provides substantial advantages, particularly in visual tasks. In the paper, they used LLaVA-OneVision. However, given the fast pace of model progress, I’d be interested in seeing how Qwen2.5-VL does.
Unify Actions: A sophisticated action tokenizer is essential for handling discrete and continuous actions across different domains.
Two-Stage Training Works Best: Supervised finetuning and reinforcement
learning create the most capable agents.
- Stage 1: Supervised Finetuning (SFT): First, fine-tune a pretrained VLM
  using supervised learning on a large collection of embodied experiences
  (demonstrations) from diverse domains. This stage adapts the VLM for
  embodied decision-making and is the process which creates the GEA-Base
  model.
- Stage 2: Online Reinforcement Learning (RL): Follow up the SFT with
  online RL training in interactive simulators for a subset of domains. This
  stage uses the agent’s interactions to learn and improve. Techniques like
  LoRA are used for efficient fine-tuning during this stage.
Diversity Matters: Training with cross-domain data (2.2M+ trajectories) significantly boosts performance compared to single-domain training.
Reinforcement Learning is Crucial: Online RL is essential for developing robust agents to recover from mistakes.
Build on a Strong Foundation: RL is most effective after initial supervised
training, rather than from scratch.

In essence, to adapt a general VLM/MLLM into a capable visual agent for diverse embodied tasks, the paper’s lessons point towards leveraging a strong pretrained MLLM base, learning a flexible action representation,and training with a combination of large-scale supervised data from multiple domains and subsequent online reinforcement learning.

ShowUI: Advanced Vision-Language-Action for GUI Interactions

ShowUI introduces a vision-language-action (VLA) model specifically designed for GUI visual agents that operate in the digital world. Unlike traditional GUI automation methods that rely on metadata like HTML, ShowUI takes a more human-like approach by focusing on visual perception and interaction.

The research addresses key challenges:

Expensive visual modeling of high-resolution screenshots
Managing complex vision-language-action sequences

Effectively utilizing diverse training data

Important links:

The ShowUI Dataset

Rather than using a massive dataset, ShowUI employs a carefully curated corpus of 256K data instances with 2.7M element annotations across various platforms:

Web data: 22K screenshots with 576K visual elements, deliberately filtered to focus on interactive elements rather than static text
Mobile data: 97K screenshots from the AMEX dataset with valuable functionality descriptions
Desktop data: Limited 100 screenshots augmented with GPT-4o assistance to create diverse queries
Navigation data: 137K tasks from GUIAct for web and mobile navigation

A key innovation was the rebalanced sampling strategy that ensured fair exposure to each data type during training, despite significant differences in dataset sizes.

The ShowUI Model

ShowUI is built on the Qwen2-VL-2B foundation and introduces three technical innovations:

UI-Guided Visual Token Selection: Reduces computational costs by treating screenshots as connected graphs, identifying redundant visual areas while preserving important elements. This approach reduces visual tokens by 33% and speeds up training by 1.4×.
Interleaved Vision-Language-Action Streaming: Flexibly handles both multi-step navigation tasks (with visual-action history tracking) and single-screenshot, multi-action tasks through a standardized JSON action format.
Small-scale, High-quality Dataset: Demonstrates that careful curation and balanced sampling can outperform larger, noisier datasets.

Despite its relatively small size and minimal training data, ShowUI achieves 75.1% accuracy in zero-shot screenshot grounding, setting a new standard for lightweight GUI agents.

Key Lessons for Practitioners

Leverage UI structure: GUI screenshots contain inherent patterns and redundancies that can be exploited to reduce computational costs without losing important information.
Standardize actions: Use structured formats like JSON for actions and provide documentation in prompts to encourage systematic behavior.
Focus on data quality: Carefully analyze and select the most informative data types rather than simply collecting more data. For GUIs, visually rich elements often provide more value than static text.
Augment limited data: When facing scarce data (like the desktop examples), use large language models to generate diverse queries around existing annotations.
Balance your datasets: Implement sampling strategies that ensure all data types receive adequate representation during training, regardless of their original size.
Start lightweight: You don’t need massive models for effective GUI agents. A well-tuned smaller model with thoughtful data curation and efficient visual processing can deliver excellent results.

This research demonstrates that through careful design choices and targeted optimizations, even lightweight models can achieve state-of-the-art performance in complex GUI interaction tasks.

GUI-Xplore: Empowering Generalizable GUI Agents with One Exploration

GUI-Xplore introduces a novel dataset designed to overcome limitations in
existing GUI agent systems.

While current solutions often struggle with generalization across different applications, GUI-Xplore addresses this challenge by providing rich exploration context and expanding beyond basic navigation tasks. The research pairs this innovative dataset with Xplore-Agent, a baseline model demonstrating significantly improved performance in unfamiliar app environments through its unique exploration-based approach.

Important links:

The GUI-Xplore Dataset

The dataset comprises exploration videos from 312 apps spanning 6 primary software domains and 33 sub-categories. With 115 hours of exploratory content (averaging 23.73 minutes per app), it delivers comprehensive coverage of real-world GUI interactions.

GUI-Xplore includes over 32,500 question-answer pairs across five carefully designed hierarchical tasks:

Application Overview & Page Analysis: Testing understanding of app functions and specific screens
Application Usage: Evaluating the ability to infer operation sequences
Action Recall & Sequence Verification: Assessing comprehension of temporal and logical relationships

What makes GUI-Xplore unique is its exploration-first approach, which provides contextual app knowledge that enables agents to adapt to new environments, similar to how humans explore unfamiliar interfaces.

The Xplore-Agent Model

Xplore-Agent leverages the dataset through a two-stage pipeline:

Action-aware GUI Modeling: Extracts key frames from exploration videos using luminance difference detection, then converts these frames into structured textual representations of GUI elements and interactions.
Graph-Guided Environment Reasoning: Constructs a GUI Transition Graph that maps complex page relationships and interaction patterns, then guides an LLM’s reasoning across the five downstream tasks.

The authors dubbed this the “exploration-then-reasoning paradigm,” an innovation in how GUI agents are trained. Rather than immediately attempting tasks in unfamiliar environments (as traditional approaches do), this paradigm involves:

First, exploring the application interface to gather context about its structure, elements, and interaction patterns.
Then, reasoning about specific tasks using the knowledge gained during exploration

This approach mirrors how humans naturally interact with new software — we typically explore an interface before attempting specific tasks. GUI-Xplore facilitates this through its exploration videos, and Xplore-Agent implements it via its two-stage pipeline (Action-aware GUI Modeling followed by Graph-Guided Environment Reasoning).

This approach showed a 10% performance improvement over state-of-the-art methods when tested on unfamiliar applications, demonstrating the effectiveness of exploration-based learning.

Key Lessons for Practitioners

Context is crucial: Pre-exploration of interfaces dramatically improves agent performance in unfamiliar environments.
Move beyond simple automation: Complex tasks require understanding local interactions and global app structure.
Structure matters: Explicitly modeling UI transitions as graphs helps agents navigate complex application flows.
Efficient processing is essential: Converting dense visual information into structured representations makes it manageable for language models.
Operational understanding remains challenging: While Xplore-Agent improves, understanding temporal and logical relationships in action sequences still presents significant research opportunities.

Using the exploration-then-reasoning paradigm, developers can create more adaptable GUI agents that better mimic human approaches to navigating unfamiliar interfaces. This could potentially unlock more natural and effective human-computer interaction.

SpiritSight Agent: Advanced GUI Agent with One Look

SpiritSight is designed to help users interact with graphical interfaces by automatically making decisions based on screenshots.

This research tackles a fundamental challenge that has limited previous vision-based agents: poor element grounding (the ability to accurately identify and locate GUI elements like buttons and text boxes). While vision-based approaches offer better cross-platform compatibility than methods requiring HTML or XML data, but they’ve historically struggled with precision locating interface elements.

Important links:

The GUI-Lasagne Dataset

The GUI-Lasagne dataset is at the heart of SpiritSight’s capabilities, a meticulously structured collection comprising 5.73 million samples gathered using scalable, cost-effective methods from real-world sources.

The GUI-Lasagne dataset gets its name from its layered structure, designed to systematically build agentic capabilities from foundational skills to complex navigation. This hierarchical approach equips SpiritSight with robust GUI understanding and grounding capabilities through three distinct levels:

Level One: Visual-Text Alignment (3M samples)

Purpose: Builds foundational ability to recognize and locate text/icon elements
Key tasks: text2bbox (locate elements from text), bbox2text (recognize content within areas), and bbox2dom (understand GUI layout)
Collection: Gathered from real-world web and mobile interfaces using automated tools
Emphasis: Intentionally contains the most abundant data to develop robust grounding capabilities

Level Two: Visual-Function Alignment (1.5M samples)

Purpose: Teaches locating elements based on their function
Collection: Synthesized using powerful vision models to generate functional descriptions
Validation: Human-verified with 90.9% acceptance rate
Output: Creates function2bbox pairs linking element purposes to their locations

Level Three: Visual Navigation (0.64M samples)

Purpose: Trains on complete navigation trajectories
Innovation: Cleaned using GPT-4o with Chain-of-Thought reasoning to filter out incorrect labels
Quality: Human validation confirmed 93.7% reliability in the cleaning process

This hierarchical approach deliberately builds strong grounding abilities before addressing complex navigation tasks, with 90% of data dedicated to grounding (Levels One and Two). The dataset supports both web and mobile platforms and includes English and Chinese samples, enabling cross-platform and cross-lingual capabilities.

Ablation studies confirmed each level’s value, with even mobile navigation data improving web navigation performance through cross-platform knowledge transfer.

The SpiritSight Model

SpiritSight introduces a novel technical approach called Universal Block Parsing (UBP) that solves a fundamental problem in processing high-resolution GUI screenshots:

Resolves Positional Ambiguity: UBP replaces traditional global coordinates with block-specific coordinates, creating clear one-to-one mappings between visual inputs and element locations.
Enhances Spatial Understanding: Incorporates 2D Block-wise Position Embedding to preserve spatial relationships between interface elements.

These innovations enable SpiritSight to:

Achieve Superior Performance: Outperforms other advanced methods across diverse GUI benchmarks.
Work Across Platforms: Functions effectively on web and mobile interfaces without platform-specific adaptation.
Scale Effectively: Available in different model sizes (2B, 8B, 26B parameters) to balance performance and resource requirements.
Operate End-to-End: Processes screenshots directly to actions without requiring intermediate tools like OCR or candidate element extraction.

By solving the critical element grounding challenge that has limited previous approaches, SpiritSight represents a significant step toward truly versatile, vision-based AI assistants that can help users navigate any graphical interface with unprecedented accuracy and reliability.

The model is available in three sizes (2B, 8B, and 26B parameters) and evaluations show it outperforms other advanced methods across diverse GUI navigation benchmarks, demonstrating exceptional cross-platform compatibility.

Key Lessons for Practitioners

This research demonstrates that with carefully designed datasets and innovative methods like UBP, vision-based GUI agents can achieve the accuracy and reliability needed for practical applications across diverse interface environments. If you’re developing your own GUI agent or dataset, consider these critical insights:

Prioritize element grounding — The ability to locate interface elements accurately is the foundation of effective GUI agents. Dedicate significant training data to this skill.
Structure datasets hierarchically — Build from fundamental skills (recognition, grounding) to complex tasks (navigation) for stronger learning foundations.
Address coordinate ambiguity — When working with high-resolution inputs, consider techniques like UBP to resolve positional ambiguity for better grounding accuracy.
Quality trumps quantity — While scale matters, carefully filtered and cleaned data is more valuable than raw volume. Consider using Chain-of-Thought reasoning to structure and verify navigation data.
Embrace cross-platform training — Including diverse GUI environments (web, mobile) in training data enhances versatility and generalization. Mobile navigation data can even help with web navigation tasks.
Consider end-to-end approaches — With the right dataset and methods, end-to-end vision-based approaches can overcome previous limitations and achieve impressive performance without complex multi-stage pipelines.

ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems

Important links:

This research introduces ComfyBench and ComfyAgent as contributions to a new frontier in Visual Agent research: using LLM-based agents to design collaborative AI systems autonomously.

This approach represents a significant paradigm shift from traditional AI research, which has primarily focused on developing monolithic models to maximize performance on specific tasks. Instead, this work explores how agents can design complex systems that integrate multiple specialized models and tools to achieve more sophisticated outcomes.

ComfyBench stands as the first-of-its-kind comprehensive benchmark specifically designed to evaluate agents’ capabilities in designing and executing collaborative AI systems within ComfyUI — an open-source platform where users construct workflows by connecting nodes (representing different models or tools) in directed acyclic graphs (DAGs). Complementing this benchmark is ComfyAgent, a novel framework built upon two core innovations: representing workflows with code rather than other representations, and employing a sophisticated multi-agent system with specialized roles (planning, retrieval, adaptation, refinement) that collaborate to overcome limitations like context constraints and hallucination.

The ComfyBench Benchmark

ComfyBench evaluates agents’ ability to construct ComfyUI workflows through 200 diverse tasks, with documentation for 3,205 nodes and 20 tutorial workflows as resources. Tasks may include auxiliary media requiring specific processing.

The benchmark uses three difficulty levels:

100 “Vanilla” tasks: Basic workflow adaptations
60 “Complex” tasks: Integration of multiple workflow techniques
40 “Creative” tasks: Pushing beyond imitation toward innovation

These challenges test visual programming logic, natural language translation, tool selection, multi-modal reasoning, and parameter tuning. The most advanced tasks require integrating techniques across domains like image generation and video processing.

Unlike benchmarks measuring output quality, ComfyBench assesses agents’ fundamental ability to orchestrate AI components into functional visual systems.

Evaluation Metrics

The research introduces novel evaluation metrics specifically designed for assessing workflow generation. Traditional metrics for image or video generation aren’t applicable here because the focus is on evaluating the generated workflows themselves, not just their outputs.

Two progressive evaluation metrics are employed:

1. Pass Rate

Measures the ratio of tasks where generated workflows are syntactically and semantically correct
A task is marked as “passed” only if the server successfully executes the workflow and returns a success message
This metric verifies the functional correctness of the designed system

2. Resolve Rate

Measures the ratio of tasks where workflows produce results matching the task requirements
Evaluation uses Visual Language Models (VLMs), specifically GPT-4o, to assess alignment between outputs and instructions
The VLM reviews both the task instruction and generated output, providing a True/False judgment on compliance

This two-tier evaluation uniquely assesses an agent’s ability to design collaborative AI systems, both in terms of process (can the workflow execute?) and outcome (does it achieve the desired result?).

The ComfyAgent Framework

ComfyAgent enables LLMs to autonomously design collaborative AI
workflows in ComfyUI through two key innovations:

Python-like Code Representation — Workflows are represented in code format rather than JSON or element lists, leveraging LLMs’ code generation abilities and providing richer semantic information. Ablation studies confirm this is the most effective format.
Multi-Agent Architecture — Addresses single-agent limitations through specialized agents:
- PlanAgent: Creates and updates the workflow strategy
- CombineAgent: Integrates multiple workflows
- AdaptAgent: Adjusts workflow parameters
- RefineAgent: Checks and fixes errors
- RetrieveAgent: Gathers relevant knowledge

A memory system stores history, reference materials, and the current workspace. Removing any agent component reduces overall performance, confirming each plays a vital role.

Key Lessons for Practitioners

Representation Matters: Code-based workflow representation significantly outperforms other formats like JSON or element lists.
Knowledge Retrieval Is Essential: Effective agents must retrieve and utilize documentation and example workflows rather than relying solely on their inherent knowledge.
Multi-Agent Architecture Works Better: Breaking down complex workflow design into specialized roles (planning, retrieval, adaptation, refinement) helps overcome limitations like context windows and hallucination.
Creative Tasks Remain Challenging: While current agents perform reasonably well on simpler tasks, they struggle with novel applications — ComfyAgent resolved only 15% of creative tasks.
Dataset Construction Advice: When building similar benchmarks, focus on comprehensive documentation, diverse tiered tasks, well-annotated examples, and reliable automated evaluation methods.

This research paves the way for more intelligent collaborative AI systems, though significant challenges remain in developing agents that can streamline rather than just imitate existing workflows.

The Future of Visual Agents is Moving from Perception to Interaction

The CVPR 2025 papers mark a revolutionary leap in visual AI. Visual Agents have evolved from theory to reality, solving fundamental challenges that previously limited their capabilities.

GEA’s unified action tokenizer, ShowUI’s efficient visual processing, GUI-Xplore’s exploration paradigm, and SpiritSight’s Universal Block Parsing collectively bridge the gap between visual understanding and action. Meanwhile, ComfyBench points toward agents that can orchestrate entire AI systems.

This transformation arrives just as digital interfaces permeate every aspect of life. The ability to automate visual interactions promises to streamline workflows, enhance accessibility, and enable entirely new capabilities.

We’re witnessing the early stages of truly agentic Visual AI systems—agent AI that don’t just perceive the world but meaningfully act within it.

CVPR 2025 will likely be remembered as the tipping point where visual agents made the crucial transition from possibility to practicality.

Talk to a computer vision expert

AI Systems that See, Understand, and Act

Why This Research Wave Matters Now for Agentic AI

From Research to Practical Breakthroughs

The Visual Agent papers from CVPR I’m most excited about are:

How are Visual Agents Different from Vision Language Models

Output Modality: Actions vs. Text

Element Grounding

Processing High-Resolution GUI Inputs Efficiently

Managing Interleaved Vision-Language-Action History

The Action-Perception Gap

The Action Space Challenge

The Missing Embodiment

From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons

Impressive Results

Key Lessons for Practitioners

ShowUI: Advanced Vision-Language-Action for GUI Interactions

The ShowUI Dataset

The ShowUI Model

Key Lessons for Practitioners

GUI-Xplore: Empowering Generalizable GUI Agents with One Exploration

The GUI-Xplore Dataset

The Xplore-Agent Model

Key Lessons for Practitioners

SpiritSight Agent: Advanced GUI Agent with One Look

The GUI-Lasagne Dataset

The SpiritSight Model

Key Lessons for Practitioners

ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems

The ComfyBench Benchmark

Evaluation Metrics

The ComfyAgent Framework

Key Lessons for Practitioners

The Future of Visual Agents is Moving from Perception to Interaction

This transformation arrives just as digital interfaces permeate every aspect of life. The ability to automate visual interactions promises to streamline workflows, enhance accessibility, and enable entirely new capabilities.

Talk to a computer vision expert

Related posts

Related posts