In this hands-on workshop, you'll use
FiftyOne and the
High Quality Invoice Images for OCR dataset to run the full data-centric loop end-to-end: embed invoices with a modern visual document model, cluster them by structure, run LightOnOCR as your base model, and use per-sample evaluation scores layered onto embedding space to find *where* and *why* it fails. You'll then turn those insights into a curated fine-tuning view — combining low ANLS, high representativeness, and uniqueness filters — fine-tune LightOnOCR, and come back to FiftyOne to verify the failure clusters actually got fixed.
The punchline: a few hundred invoices chosen by combining embedding signals with evaluation metrics beats thousands of randomly sampled ones, every time.