AI, Machine Learning and Computer Vision Meetup

April 24, 2025 | 10 AM Pacific

This event is now over.

Towards a Multimodal AI Agent that Can See, Talk and Act

Jianwei Yang

Microsoft Research

The development of multimodal AI agents marks a pivotal step toward creating systems capable of understanding, reasoning, and interacting with the world in human-like ways. Building such agents requires models that not only comprehend multi-sensory observations but also act adaptively to achieve goals within their environments. In this talk, I will present my research journey toward this grand goal across three key dimensions.

First, I will explore how to bridge the gap between core vision understanding and multimodal learning through unified frameworks at various granularities. Next, I will discuss connecting vision-language models with large language models (LLMs) to create intelligent conversational systems. Finally, I will delve into recent advancements that extend multimodal LLMs into vision-language-action models, forming the foundation for general-purpose robotics policies. To conclude, I will highlight ongoing efforts to develop agentic systems that integrate perception with action, enabling them to not only understand observations but also take meaningful actions in a single system.

Together, these lead to an aspiration of building the next generation of multimodal AI agents capable of seeing, talking, and acting across diverse scenarios in both digital and physical worlds.

ConceptAttention: Interpreting the Representations of Diffusion Transformers

Alec Helbling

Georgia Tech

Recently, diffusion transformers have taken over as the state-of-the-art model class for both image and video generation. However, similar to many existing deep learning architectures, their high-dimensional hidden representations are difficult to understand and interpret. This lack of interpretability is a barrier to their controllability and safe deployment.

We introduce ConceptAttention, an approach to interpreting the representations of diffusion transformers. Our method allows users to create rich saliency maps depicting the location and intensity of textual concepts. Our approach exposes how a diffusion model “sees” a generated image and notably requires no additional training. ConceptAttention improves upon widely used approaches like cross attention maps for isolating the location of visual concepts and even generalizes to real world (not just generated) images and video generation models!

Our work serves to improve the community’s understanding of how diffusion models represent data and has numerous potential applications, like image editing.

RelationField: Relate Anything in Radiance Fields

Sebastian Koch

Ulm University and Bosch Center for Artificial Intelligence

Neural radiance fields recently emerged as a 3D scene representation extended by distilling open-vocabulary features from vision-language models. Current methods focus on object-centric tasks, leaving semantic relationships largely unexplored. We propose RelationField, the first method extracting inter-object relationships directly from neural radiance fields using pairs of rays for implicit relationship queries. RelationField distills relationship knowledge from multi-modal LLMs. Evaluated on open-vocabulary 3D scene graph generation and relationship-guided instance segmentation, RelationField achieves state-of-the-art performance.

RGB-X Model Development: Exploring Four Channel ML Workflows

Daniel Gural

Voxel51

Machine Learning is rapidly becoming multimodal. With many models in Computer Vision expanding to areas like vision and 3D, one area that has also quietly been advancing rapidly is RGB-X data, such as infrared, depth, or normals. In this talk we will cover some of the leading models in this exploding field of Visual AI and show some best practices on how to work with these complex data formats!

Find a Meetup Near You

Join the AI and ML enthusiasts who have already become members

The goal of the AI, Machine Learning, and Computer Vision Meetup network is to bring together a community of data scientists, machine learning engineers, and open source enthusiasts who want to share and expand their knowledge of AI and complementary technologies. If that’s you, we invite you to join the Meetup closest to your timezone.

What Is Visual AI? Going Beyond Computer Vision

RIOS’s AI-Powered Robotics Solutions Run on FiftyOne Enterprise

AI, Machine Learning and Computer Vision Meetup

This event is now over.

Towards a Multimodal AI Agent that Can See, Talk and Act

Jianwei Yang

ConceptAttention: Interpreting the Representations of Diffusion Transformers

Alec Helbling

RelationField: Relate Anything in Radiance Fields

Sebastian Koch

RGB-X Model Development: Exploring Four Channel ML Workflows

Daniel Gural

Find a Meetup Near You

Amsterdam

Ann Arbor

Athens

Austin

Bangalore

Barcelona

Belgium

Berlin

Boston

Chicago

Delhi

Hyderabad

London

Madrid

Mumbai

Munich

New York

North Rhine-Westphalia

Paris

Peninsula

Raleigh

Rome

San Francisco

Seattle

Silicon Valley

Singapore

Stuttgart

Toronto

Jianwei Yang

Microsoft Research

Alec Helbling

Georgia Tech

Sebastian Koch

Ulm University and Bosch Center for Artificial Intelligence

Daniel Gural

Voxel51