Register for the event
In-person
EMEA
CV Meetups
Munich AI, ML and Computer Vision Meetup - April 22, 2026
Apr 22, 2026
5:30 - 8:30 PM
Impact Hub Munich Gotzinger Str. 8 München, Germany 81371
Speakers
About this event
Join the Meetup to hear talks from experts on cutting-edge topics across AI, ML, and computer vision.
Schedule
Learning Disentangled Motion Representations for Open-World Motion Transfer
Recent progress in image- and text-to-video generation has made it possible to synthesize visually compelling videos, yet these models typically lack an explicit, reusable notion of motion. In this talk, I will present recent work on learning high-level, content-independent motion representations directly from open-world video data, with a focus on our NeurIPS spotlight paper introducing DisMo. By disentangling motion semantics from appearance and object identity, such representations enable open-world motion transfer across semantically unrelated entities and provide a flexible interface for adapting and fine-tuning modern video generation models. Beyond generation, I will discuss how abstract motion representations support downstream motion understanding tasks and why they offer a promising direction for more controllable, general, and future-proof video models. The talk will conclude with a broader perspective on the opportunities and challenges of motion-centric representations in computer vision and video learning.
Towards Generating Fully Navigable 3D Scenes
3D world generation is a longstanding goal of computer vision with applications in VR/gaming/movies, robotics, and digital twins. Recent progress in generative models, in particular image and video diffusion models, enables automatic generation of photorealistic 3D environments. This talk describes a simple yet effective framework to exploit these models for 3D scene genration. Namely, we'll briefly talk about early approaches (Text2Room, ViewDiff) and dive deep into our recent state-of-the-art approach WorldExplorer.
Finding Motion in Commotion: Estimating and Anticipating Motion in Everyday Visual Scenes
Motion is an intrinsic property of video data. How do we harness motion from the abundance of videos to advance vision foundation models? This talk will examine key challenges and emerging opportunities in motion estimation and motion-aware representation learning at scale. Drawing on our latest results from NeurIPS and ICCV, the talk will show how motion-centric learning can enable more versatile and generalisable vision foundation models.
Small Models, Big Intelligence: How vLLM Semantic Router Uses Sub-2B Language Models for Production-Scale Routing
The vLLM Semantic Router introduces a groundbreaking approach to intelligent LLM request routing through its MoM (Mixture of Models) family, a collection of specialized small language models that make split-second routing decisions for production systems. This system operates between users and models, capturing signals from requests, responses, and context to make intelligent routing decisions, including model selection, safety filtering (jailbreak, PII), semantic caching, and hallucination detection. In this talk, we'll explore how the router leverages tiny but powerful models like ModernBERT (encoder-based) and Qwen3 (0.6B-1.7B parameter decoder models) to achieve sub-10ms latency classification at over 10,000 queries per second. We'll dive into the technical architecture showing how these small models handle domain classification, jailbreak detection, PII protection, and hallucination detection, proving that for routing intelligence, size isn't everything.
Data Foundations for Vision-Language-Action Models
Model architectures get the papers, but data decides whether robots actually work. This talk introduces VLAs from a data-centric perspective: what makes robot datasets fundamentally different from image classification or video understanding, how the field is organizing its data (Open X-Embodiment, LeRobot, RLDS), and what evaluation benchmarks actually measure. We'll examine the unique challenges such as temporal structure, proprioceptive signals, and heterogeneity in embodiment, and discuss why addressing them matters more than the next architectural innovation.