Boston AI, ML and Computer Vision Meetup

In-person

Americas

CV Meetups

Boston AI, ML and Computer Vision Meetup - February 18, 2026

This event has ended, but you can still catch up! Watch the on-demand recordings and register for our future events.

Feb 18, 2026

5:30 - 8:30 PM EST

Microsoft Research Lab – New England (NERD) at MIT Deborah Sampson Conference Room One Memorial Drive, Cambridge, MA 02142

Speakers

About this event

Join the Meetup to hear talks from experts on cutting-edge topics across AI, ML, and computer vision.

Schedule

SNAP: Towards Segmenting Anything in Any Point Cloud

Segmenting objects in 3D point clouds is a core problem in 3D scene understanding and scalable data annotation. In this talk, I will present SNAP: Segmenting Anything in Any Point Cloud, a unified framework for interactive point cloud segmentation that supports both point-based and text-based prompts across indoor, outdoor, and aerial domains. SNAP is trained jointly on multiple heterogeneous datasets and achieves strong cross-domain generalization through domain-adaptive normalization. The model enables both spatially prompted instance segmentation and text-prompted panoptic and open-vocabulary segmentation directly on point clouds. Extensive experiments demonstrate that SNAP matches or outperforms domain-specific methods on a wide range of zero-shot benchmarks.

Resources

About the Speaker

Hanhui Wang is a first-year Ph.D. student at the Visual Intelligence Lab at Northeastern University. His research centers on 3D scene understanding, with recent work on point cloud segmentation and structured representations, and broader interests in generation and reasoning for multimodal 3D/4D perception.

Beyond the Black Box: Vision Language Models That Explain and Empower

As Vision Language Models (VLMs) become embedded in how we process and interpret data, we face a critical challenge: these models often provide answers without evidence and support users without building their skills. To create a trustworthy future for AI, we must move toward explainable systems that foster human autonomy. In this talk, I will present two recent works addressing these pillars. First, I will introduce RADAR, a reasoning-guided attribution framework that transforms VLMs into transparent analysts. While standard models often hallucinate justifications for complex data, RADAR enables them to attribute reasoning steps to specific visual regions in charts and graphs, improving attribution accuracy by 15% and allowing users to verify logic through precise visual grounding. Second, using a month-long longitudinal study, I will explore the cognitive "Dependency Paradox"—while AI interaction boosts immediate accuracy in misinformation tasks by 21%, a user's independent ability to discern truth declines by 15.3% over time. The path forward isn't just better models—it's designing AI that makes humans better.

Resources

About the Speaker

Anku Rani is a doctoral researcher at the Massachusetts Institute of Technology, investigating machine learning models for video generation along with projects at the intersection of natural language processing and human-computer interaction. Her research spans multimodality, mathematical reasoning, attribution, and fact verification, with work published in leading AI conferences. Prior to MIT, Anku conducted research at Adobe Research and the University of South Carolina's Artificial Intelligence Institute. Anku actively contributes to the academic community through conference reviews, workshop organization, and program committee service. She holds postgraduate degrees in AI and ML and brings over 5 years of industrial and startup experience to her doctoral research.

Data Foundations for Vision-Language-Action Models

Model architectures get the papers, but data decides whether robots actually work. This talk introduces VLAs from a data-centric perspective: what makes robot datasets fundamentally different from image classification or video understanding, how the field is organizing its data (Open X-Embodiment, LeRobot, RLDS), and what evaluation benchmarks actually measure. We'll examine the unique challenges such as temporal structure, proprioceptive signals, and heterogeneity in embodiment, and discuss why addressing them matters more than the next architectural innovation.

Resources

Slides

About the Speaker

Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in VLMs, Visual Agents, Document AI, and Physical AI.

Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models

Traditional multimodal learners find unified representations for tasks like visual question answering, but rely heavily on paired datasets. However, an overlooked yet potentially powerful question is: can one leverage auxiliary unpaired multimodal data to directly enhance representation learning in a target modality? We introduce UML: Unpaired Multimodal Learner, a modality-agnostic training paradigm in which a single model alternately processes inputs from different modalities while sharing parameters across them. This design exploits the assumption that different modalities are projections of a shared underlying reality, allowing the model to benefit from cross-modal structure without requiring explicit pairs. Theoretically, under linear data-generating assumptions, we show that unpaired auxiliary data can yield representations strictly more informative about the data-generating process than unimodal training. Empirically, we show that using unpaired data from auxiliary modalities---such as text, audio, or images---consistently improves downstream performance across diverse unimodal targets such as image and audio.

Resources

Project page

About the Speaker

Sharut Gupta is a fourth-year Ph.D student at MIT CSAIL, advised by Prof. Phillip Isola and Prof. Stefanie Jegelka. Prior to this, she completed her undergraduate studies in Mathematics and Computing at the Indian Institute of Technology, Delhi (IIT Delhi), during which she worked with Prof. Yoshua Bengio on her thesis. She has also spent time at Meta SuperIntelligence Labs (Meta AI), and Google DeepMind.

Neural Radiance Fields for Image Verification

We propose an image verification method that embeds physical refraction as an authenticity signature. To verify an image, we compare it to a pixel-aligned reconstruction derived from the refraction and flag inconsistencies. Manipulations are detectable because maintaining geometric consistency with the refractive object is difficult without knowing its refractive properties. Unlike prior work that relies on simple analytic refractions and slow per-scene NeRF optimization, we train a compact, scene-agnostic neural refraction field that models complex geometries and enables instant, high-fidelity reconstruction for detection and localization.

About the Speaker

Sage completed her BSc in Electrical Engineering and Computer Science at MIT and her MEng in Computer Science at MIT. She is interested in the intersection of physics, AI, and computer vision. She currently works on AI for physics simulation at Pasteur Labs.