Register for the Zoom
Virtual
Americas
Workshops
Document Visual AI Workshop - September 2, 2026
Sep 02, 2026
9:00 AM - 11:00 AM PST
Online. Register for the Zoom!
About this event
In this hands-on workshop, you'll use FiftyOne and the High Quality Invoice Images for OCR dataset to run the full data-centric loop end-to-end: embed invoices with a modern visual document model, cluster them by structure, run LightOnOCR as your base model, and use per-sample evaluation scores layered onto embedding space to find *where* and *why* it fails. You'll then turn those insights into a curated fine-tuning view — combining low ANLS, high representativeness, and uniqueness filters — fine-tune LightOnOCR, and come back to FiftyOne to verify the failure clusters actually got fixed.
The punchline: a few hundred invoices chosen by combining embedding signals with evaluation metrics beats thousands of randomly sampled ones, every time.
Host

What You'll Walk Away With

  • A working FiftyOne pipeline for any document collection you own
  • A repeatable curation query that combines evaluation + embedding signals
  • A fine-tuned LightOnOCR checkpoint that demonstrably outperforms the base model on your invoices
  • The mental model that data curation — not architecture or hyperparameters — is the highest-leverage thing you can do to improve a document AI system