Part 1: Navigating the GUI Agent Landscape

Understanding the Foundation Before Building

The GUI agent field is evolving rapidly, but success requires an understanding of what came before. In this opening session, we'll map the terrain of GUI agent research—from the early days of MiniWoB's simplified environments to today's complex, multimodal systems tackling real-world applications. You'll discover why standard vision models fail catastrophically on GUI tasks, explore the annotation bottlenecks that make GUI datasets so expensive to create, and understand the platform fragmentation that makes "click a button" mean twenty different things across datasets.

We'll dissect the most influential datasets (Mind2Web, AITW, Rico) and models that have shaped the field, examining their strengths, limitations, and the research gaps they reveal. By the end, you'll have a clear picture of where GUI agents excel, where they struggle, and, most importantly, where the opportunities lie for your own contributions.

Part 2: From Pixels to Predictions - Building Your GUI Dataset

Hands-On Dataset Creation and Curation with FiftyOne

The best GUI models are only as good as their training data, and the best datasets are built by understanding what makes GUI interactions fundamentally different from natural images. In this practical session, you'll build a complete GUI dataset from scratch, learning to capture the precise annotations that GUI agents need.

Using FiftyOne as your data management backbone, you'll import diverse GUI screenshots, explore annotation strategies that go beyond bounding boxes, and implement efficient labeling workflows. We'll tackle the real challenges: handling platform differences, managing annotation quality, and creating datasets that transfer to new domains. You'll also learn advanced techniques like synthetic data generation and automated prelabeling to scale your annotation efforts.

Walk away with a production-ready dataset and the skills to build more—because in GUI agents, data quality determines everything.

By the end, you'll have both a dataset and the methodology to build the next generation of GUI training data.

Part 3: Teaching Machines to See and Click - Model Finetuning

From Foundation Models to GUI Specialists

Foundation models, such as Qwen2.5-VL, demonstrate impressive visual understanding, but they require specialized training to master GUI interactions. In this final session, you'll transform a general-purpose vision-language model into a GUI specialist that can navigate interfaces with human-like precision.

We'll explore modern fine-tuning strategies specifically designed for GUI tasks, from selecting the right architecture to handling the unique challenges of coordinate prediction and multi-step reasoning. You'll implement training pipelines that can handle the diverse formats and platforms in your dataset, evaluate models on metrics that actually matter for GUI automation, and deploy your trained model in a real-world testing environment.