Upcoming events

Best of ICRA - July 20, 2026

Meetups • Jul 20, 2026

AI, ML, and Computer Vision Meetup - July 23, 2026

Meetups • Jul 23, 2026

See all events

Talk to a computer vision expert

Book a demo

View All Events

Virtual

Americas

Meetups

Text industry

View All Events

Visual Document AI: Because a Pixel is Worth a Thousand Tokens - November 6, 2025

Name: Visual Document AI: Because a Pixel is Worth a Thousand Tokens - November 6, 2025
Start: 2025-11-06
End: 2025-11-06

This event has ended, but you can still catch up! Watch the on-demand recordings and register for our future events.

Nov 06, 2025

9 - 11 AM Pacific

Online. Register for the Zoom!

Speakers

About this event

Join us for a virtual event to hear talks from experts on the latest developments in Visual Document AI.

Schedule

Document AI: A Review of the Latest Models, Tasks and Tools

In this talk, go through everything document AI: trends, models, tasks, tools! By the end of this talk you will be able to get to building apps based on document models.

Resources:

Smol Vision

About the Speaker:

Merve Noyan works on multimodal AI and computer vision at Hugging Face, and she's the author of the book Vision Language Models on O'Reilly.

Run Document VLMs in Voxel51 with the VLM Run Plugin — PDF to JSON in Seconds

The new VLM Run Plugin for Voxel51 enables seamless execution of document vision-language models directly within the Voxel51 environment. This integration transforms complex document workflows — from PDFs and scanned forms to reports — into structured JSON outputs in seconds. By treating documents as images, our approach remains general, scalable, and compatible with any visual model architecture. The plugin connects visual data curation with model inference, empowering teams to run, visualize, and evaluate document understanding models effortlessly. Document AI is now faster, reproducible, and natively integrated into your Voxel51 workflows.

Resources:

VLM Run Plugin Docs

About the Speaker

Dinesh Reddy is a founding team member of VLM Run, where he is helping nurture the platform from a sapling into a robust ecosystem for running and evaluating vision-language models across modalities. Previously, he was a scientist at Amazon AWS AI, working on large-scale machine learning systems for intelligent document understanding and visual AI. He completed his Ph.D. at the Robotics Institute, Carnegie Mellon University, focusing on combining learning-based methods with 3D computer vision for in-the-wild data. His research has been recognized with the Best Paper Award at IEEE IVS 2021 and fellowships from Amazon Go and Qualcomm.

CommonForms: Automatically Making PDFs Fillable

Converting static PDFs into fillable forms remains a surprisingly difficult task, even with the best commercial tools available today. We show that with careful dataset curation and model tuning, it is possible to train high-quality form field detectors for under $500. As part of this effort, we introduce CommonForms, a large-scale dataset of nearly half a million curated form images. We also release a family of highly accurate form field detectors, FFDNet-S and FFDNet-L.

Resources:

Try CommonForms

About the Speaker:

Joe Barrow is a researcher at Pattern Data, specializing in document AI and information extraction. He previously worked at the Adobe Document Intelligence Lab after receiving his PhD from the University of Maryland in 2022.

Visual Document Retrieval: How to Cluster, Search and Uncover Biases in Document Image Datasets Using Embeddings

In this talk you'll learn about the task of visual document retrieval, the models which are widely used by the community, and see them in action through the FiftyOne App where you'll learn how to use these models to identify groups and clusters of documents, find unique documents, uncover biases in your visual document dataset, and search over your document corpus using natural language.

Resources:

GitHub Repo

About the Speaker:

Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in VLMs, Visual Agents, Document AI, and Physical AI.