Virtual

Americas

Meetups

Understanding Visual Agents - August 7, 2025

Aug 7, 2025

9 AM Pacific

Online. Register for the Zoom!

Speakers

About this event

Join the Meetup to hear talks from experts on understanding visual agents.

Schedule

Foundational capabilities and models for generalist agents for computers

As we move toward a future where language agents can operate software, browse the web, and automate tasks across digital environments, a pressing challenge emerges: how do we build foundational models that can act as generalist agents for computers? In this talk, we explore the design of such agents—ones that combine vision, language, and action to understand complex interfaces and carry out user-intent accurately.

We present OmniACT as a case study, a benchmark that grounds this vision by pairing natural language prompts with UI screenshots and executable scripts for both desktop and web environments. Through OmniACT, we examine the performance of today’s top language and multimodal models, highlight the limitations in current agent behavior, and discuss research directions needed to close the gap toward truly capable, general-purpose digital agents.

BEARCUBS: Evaluating Web Agents' Real-World Information-Seeking Abilities

The talk focuses on the challenges of evaluating AI agents in dynamic web settings, the design and implementation of the BEARCUBS benchmark, and insights gained from human and agent performance comparisons. In the talk, we will discuss the significant performance gap between human users and current state-of-the-art agents, highlighting areas for future improvement in AI web navigation and information retrieval capabilities.

Implementing a Practical Vision-Based Android AI Agent

In this talk, I will share with you practical details of designing and implementing Android AI agents, using deki (GitHub - RasulOs/deki: ML model (or several models) to describe the contents of the UI screen)

From theory, we will move to practice and the usage of these agents in
industry/production.

For end users - remote usage of Android phones or for automation of standard tasks. Such as:

"Write my friend 'some_name' in WhatsApp that I'll be 15 minutes late"
"Open Twitter in the browser and write a post about 'something'"
"Read my latest notifications and say if there are any important ones"
"Write a linkedin post about 'something'"

And for professionals - to enable agentic testing, a new type of test that only became possible because of the popularization of LLMs and AI agents that use them as a reasoning core.

Visual Agents: What it takes to build an agent that can navigate GUIs like humans

We’ll examine conceptual frameworks, potential applications, and future directions of technologies that can “see” and “act” with increasing independence. The discussion will touch on both current limitations and promising horizons in this evolving field.