Register for the event
Virtual
Americas
CV Meetups
Visual Agents: What it Takes to Build an Agent that can Navigate GUIs like Humans - April 9, 2026
Apr 9, 2026
9 AM Pacific
Online. Register for the Zoom!
About this event
Join our virtual meetup to hear talks from experts on cutting-edge topics across AI, ML, and computer vision. View more CV events here.
Host
Visual agents that can understand and interact with graphical user interfaces represent a transformative frontier in AI automation. These systems combine computer vision, natural language understanding, and spatial reasoning to enable machines to navigate complex interfaces—from web applications to desktop software—just as humans do. However, building robust GUI agents requires careful attention to dataset curation, model evaluation, and iterative improvement workflows.
This hands-on workshop provides a comprehensive introduction to building and evaluating visual agents for GUI automation using modern tools and techniques. Participants will learn how to leverage FiftyOne, an open-source toolkit for dataset curation and computer vision workflows, to build production-ready GUI agent systems.
What You'll Learn:
  • Dataset Creation & Management: How to structure, annotate, and load GUI interaction datasets using the COCO4GUI standardized format
  • Data Exploration & Analysis: Using FiftyOne's interactive interface to visualize datasets, analyze action distributions, and understand annotation patterns
  • Multimodal Embeddings: Computing embeddings for screenshots and UI element patches to enable similarity search and retrieval
  • Model Inference: Running state-of-the-art models like Microsoft's GUI-Actor to predict interaction points from natural language instructions
  • Performance Evaluation: Measuring model accuracy using standard metrics and normalized click distance to assess localization precision
  • Failure Analysis: Investigating model failures through attention maps, error pattern analysis, and systematic debugging workflows
  • Data-Driven Improvement: Tagging samples based on error types (attention misalignment vs. localization errors) to prioritize fine-tuning efforts
  • Synthetic Data Generation: Using FiftyOne plugins to augment training data with synthetic task descriptions and variations