Join our virtual meetup to hear talks from experts on cutting-edge topics across Visual AI for Video AI use cases.
Schedule
Video Reasoning for Worker Safety
Ensuring worker safety in industrial environments requires more than object detection or motion tracking; it demands a genuine understanding of human actions, context, and risk. This talk demonstrates how NVIDIA Cosmos Reason, a multimodal video-reasoning model, interprets workplace scenarios with sophisticated temporal and semantic awareness, identifying nuanced safe and unsafe behaviors that conventional vision systems frequently overlook. By integrating Cosmos Reason with FiftyOne, users achieve both automated safety assessments and transparent, interpretable explanations revealing why specific actions are deemed hazardous. Using a curated worker-safety dataset of authentic factory-floor footage, we show how video reasoning enhances audits, training, and compliance workflows while minimizing dependence on extensive labeled datasets. The resulting system demonstrates the potential of explainable multimodal AI to enable safer, more informed decision-making across manufacturing, logistics, construction, healthcare, and other sectors where understanding human behavior is essential.
VIDEOP2R: Video Understanding from Perception to Reasoning
Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning. In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning.
Layer-Aware Video Composition via Split-then-Merge
Split-then-Merge (StM) is a novel generative framework that overcomes data scarcity in video composition by splitting unlabeled videos into separate foreground and background layers for self-supervised learning. By utilizing a transformation-aware training pipeline with multi-layer fusion, the model learns to realistically compose dynamic subjects into diverse scenes without relying on expensive annotated datasets. This presentation will cover the problem of video composition and the details of StM, an approach looking at this problem from a generative AI perspective. We will conclude by demonstrating how StM is working, and outperforming state-of-the-art methods in both quantitative benchmarks and qualitative evaluations.
Video Intelligence Is Going Agentic
Video content has become ubiquitous in our digital world, yet the tools for working with video have remained largely unchanged for decades. This talk explores how the convergence of foundation models and agent architectures is fundamentally transforming video interaction and creation. We'll examine how video-native foundation models, multimodal interfaces, and agent transparency are reshaping enterprise media workflows through a deep dive into Jockey, a pioneering video agent system.