Document understanding has long been one of the most challenging problems in computer vision. Traditional OCR vision systems excel at character recognition but struggle with complex layouts, tables, formulas, and structured data extraction. Enter GLM-OCR—a lightweight yet powerful multimodal model that's changing how we approach document parsing.
In this post, we'll explore what document understanding is and how to build efficient, production-ready AI document understanding pipelines using GLM-OCR integrated with
FiftyOne, a powerful dataset management and visualization platform. We'll cover everything from basic text extraction to advanced structured data parsing, with practical document understanding integration examples you can run today.
What is document understanding?
Document understanding is an AI-powered approach that goes beyond traditional OCR, enabling machines to read, interpret, and extract structured data from documents. By combining vision and language understanding, AI document understanding systems can handle tables, formulas, and layouts, making OCR vision systems smarter and streamlining workflows with seamless document understanding integration.
How GLM-OCR differs from traditional OCR vision systems
GLM-OCR represents a fundamental shift in how we approach document understanding. To appreciate its value, we need to understand how it differs from traditional OCR vision systems.
Traditional OCR vision systems: Character recognition only
Classic OCR vision systems like Tesseract, PaddleOCR, or commercial solutions follow a character-by-character recognition approach:
- Image preprocessing: binarization, noise reduction
- Text detection: finding text regions
- Character recognition: identifying individual characters
- Post-processing: spell checking, language models
Limitations:
- Outputs flat text strings—no understanding of document structure.
- Struggles with complex layouts (tables become garbled text).
- Cannot handle formulas (mathematical expressions become meaningless text).
- Requires extensive post-processing to extract structured data.
- No semantic understanding—just character patterns.
GLM-OCR: Document understanding through multimodal AI
GLM-OCR takes a completely different approach. Instead of treating documents as collections of characters, it uses vision-language models to understand documents semantically:
- Visual-Language Fusion: The CogViT encoder captures both pixel-level details AND layout semantics simultaneously.
- Contextual Reasoning: Uses language understanding to interpret relationships between text blocks, tables, and figures.
- Structure-Aware Generation: Generates structured outputs (Markdown, JSON, LaTeX) that preserve document semantics.
- Multi-Token Prediction: Predicts multiple tokens per step and uses context to correct errors—like semantic proofreading.
Key differences:
Why GLM-OCR vision systems matters
Traditional OCR vision systems give you text, but you still need to parse tables manually or with complex regex, reconstruct formulas from text descriptions, build custom parsers for each document type, and handle edge cases and formatting variations.
GLM-OCR gives you semantic structure directly:
- Tables are already in HTML/Markdown format.
- Formulas are ready-to-render LaTeX.
- Structured data matches your exact schema.
- Document hierarchy is preserved.
This fundamentally changes how you build document processing pipelines—from "extract text then parse" to "extract structure directly".
Key Advantages:
- Lightweight: At 0.9B parameters, it runs efficiently on consumer hardware.
- Multimodal: Handles PDFs and images with equal proficiency.
- Structure-First: Outputs semantic Markdown, JSON, or LaTeX—not just text.
- Open Source: Apache-2.0 licensed, deployable anywhere.
- Context-Aware: Uses language understanding to improve accuracy.
The FiftyOne document understanding integration
While GLM-OCR is powerful on its own, integrating document understanding with FiftyOne unlocks several key benefits:
- Efficient Batching: Native support for batched inference, dramatically improving throughput.
- Dataset Management: Seamless integration with FiftyOne's dataset workflows.
- Visualization: Built-in tools for exploring OCR results and debugging.
- Flexibility: Easy switching between operation modes without reloading models.
Getting started with your document understanding integration
Installation
First, install the required dependencies. Since GLM-OCR is relatively new, you'll need the latest transformers from source:
For the best text viewing experience in FiftyOne, install the Caption Viewer plugin:
This plugin provides intelligent formatting for OCR outputs with proper line breaks, table rendering, and JSON pretty-printing—essential for reviewing complex document extractions.
Loading your dataset
FiftyOne makes it easy to work with document datasets. Let's start with a sample dataset from Hugging Face Hub:
This loads 50 scanned receipt images—perfect for demonstrating GLM-OCR's capabilities.
Registering and loading GLM-OCR
The FiftyOne integration makes GLM-OCR available as a zoo model:
The model automatically detects and uses the best available device (CUDA, MPS, or CPU), making it easy to get started regardless of your hardware setup.
Four operation modes for document understanding
GLM-OCR supports four distinct operation modes, each optimized for different use cases:
1. Text recognition
The most straightforward mode—extract plain text while preserving formatting:
Output: Clean text with preserved layout and structure. Perfect for general OCR tasks where you need readable text output.
2. Formula recognition
Extract mathematical formulas in LaTeX format—essential for scientific and technical documents:
Output: LaTeX-formatted mathematical expressions ready for rendering in scientific documents or notebooks.
3. Table recognition
Parse complex table structures into HTML or Markdown:
Output: Properly structured tables in HTML or Markdown format, preserving cell relationships and hierarchies.
4. Custom structured extraction
The most powerful mode—extract structured data using custom JSON schema prompts:
Output: JSON-formatted structured data matching your exact schema. This is incredibly powerful for building ETL pipelines, data warehouses, or API integrations.
Real-world document understanding example
Let's see GLM-OCR's document understanding in action with a real receipt. After running text extraction, here's what we get:
The text extraction preserves the layout beautifully. But with structured extraction, we get machine-readable JSON:
This structured output can be directly ingested into databases, analytics systems, or financial software—no manual parsing required.
Visualizing results
One of FiftyOne's greatest strengths is its visualization capabilities. Launch the app to explore your results:
Using the caption viewer plugin
- Click on any sample to open the modal view
- Click the + button to add a panel
- Select "Caption Viewer" from the panel list
- In the panel menu (☰), select the field you want to view:
text_extraction for plain OCR textstructured_extraction for JSON outputs- Any other text field
The plugin automatically:
- Renders line breaks properly
- Converts HTML tables to markdown
- Pretty-prints JSON content
- Shows character counts
This makes it easy to spot issues, verify accuracy, and understand what the model extracted.
Performance optimization
GLM-OCR vision system delivers impressive throughput: approximately 1.86 pages/second for PDFs and 0.67 images/second. The FiftyOne integration's batching support significantly improves efficiency.
Batch size recommendations
Choose your batch size based on available hardware:
- CPU:
batch_size=2-4 (slower but works anywhere) - GPU (8GB):
batch_size=8-16 (good balance) - GPU (16GB+):
batch_size=16-32 (maximum throughput)
Runtime configuration
One of the integration's best features is the ability to change operation modes without reloading the model:
You can also adjust generation parameters such as increasing max tokens for longer outputs:
Real-world structured output use cases
The true power of GLM-OCR vision systems emerges when you need structured data, not just text. Here are practical use cases where structured extraction is essential, along with how to use the output downstream.
Use case 1: Financial document processing
Challenge: Extract structured financial data from invoices, receipts, and statements for accounting automation.
Why structured output is essential: Financial systems require specific fields (amounts, dates, line items) in exact formats. Flat text requires complex parsing that breaks with format variations.
Implementation:
Downstream Usage:
Benefits: No manual data entry, automatic categorization, immediate integration with accounting software.
Use case 2: Medical record digitization
Challenge: Extract structured patient information from scanned medical forms, lab reports, and prescriptions.
Why structured output is essential: Healthcare systems require structured data for EHR integration, insurance claims, and clinical decision support. Unstructured text cannot be reliably parsed.
Implementation:
Downstream usage:
Benefits: Automated EHR population, real-time alerts for abnormal values, drug interaction checking, reduced manual data entry errors.
Use case 3: Legal document analysis
Challenge: Extract structured information from contracts, NDAs, and legal agreements for compliance monitoring and analysis.
Why structured output is essential: Legal analysis requires identifying specific clauses, dates, parties, and obligations. These must be structured for automated compliance checking and
risk assessment.
Implementation:
Downstream usage:
Benefits: Automated compliance checking, obligation tracking, expiration monitoring, risk flagging, reduced manual review time.
Use case 4: Academic paper processing
Challenge: Extract structured metadata, formulas, and references from research papers for knowledge base construction.
Why structured output is essential: Academic databases require structured metadata (authors, affiliations, citations). Formulas must be in LaTeX for rendering. References need to be parseable for citation networks.
Implementation:
Downstream usage:
Benefits: Automated knowledge graph construction, citation network analysis, formula rendering, concept extraction, paper recommendations.
Use case 5: Supply chain document processing
Challenge: Extract structured data from shipping manifests, customs forms, and delivery receipts for logistics automation.
Why structured output is essential: Logistics systems require structured data for tracking, customs clearance, and inventory management. Each document type has specific fields that must be extracted accurately.
Implementation:
Downstream usage:
Benefits: Automated customs processing, real-time tracking, inventory updates, cost calculation, delivery notifications.
Multi-mode processing for complex document understanding
To understand some documents, they require multiple extraction modes. Here's how to combine them:
Now each document has multiple representations, enabling different downstream uses:
- Full text: Search and retrieval
- Tables: Data analysis and visualization
- Formulas: Rendering in documentation
- Metadata: Database storage and filtering
Quality assurance workflows
Use FiftyOne's filtering and evaluation capabilities to build QA workflows:
Architecture deep dive
Understanding GLM-OCR's vision system architecture helps optimize your usage:
Three-stage pipeline
- Visual ingestion: CogViT visual encoder captures pixel-level and layout information.
- Multimodal reasoning: GLM-V-based vision-language fusion aligns visual features with language understanding.
- Structured generation: Decoder with Multi-Token Prediction (MTP) generates structured output while correcting errors.
The MTP mechanism is particularly clever—it predicts multiple tokens per step and uses context to fix errors on the fly, acting like semantic proofreading rather than naive character recognition.
Batching implementation
The FiftyOne integration implements efficient batching through:
- SupportsGetItem: Custom GetItem transform loads images as PIL Images.
- TorchModelMixin: Enables DataLoader-based batching.
- Left padding: Correctly handles decoder-only model generation.
- Variable-size images: Handles different image sizes in the same batch.
This architecture allows processing multiple documents simultaneously while maintaining accuracy.
OCR vision system best practices
1. Choose the right operation mode
- Use
text for general OCR and human-readable output - Use
formula for scientific documents with mathematical expressions - Use
table when you need structured table data - Use
custom for domain-specific structured extraction
2. Optimize batch sizes
Start with smaller batches and increase until you hit memory limits. Monitor GPU/CPU usage to find the sweet spot.
3. Store raw outputs
Always store the raw GLM-OCR output (Markdown/JSON) alongside processed data. This allows reprocessing as your downstream logic evolves.
4. Combine with LLMs
GLM-OCR excels at structure extraction. Feed its output into LLMs for:
- Summarization
- Question answering
- Risk analysis
- Report generation
5. Handle edge cases
For low-quality scans, consider preprocessing:
- Binarization
- De-skewing
- Noise reduction
This helps the visual encoder and improves structure detection.
The future of OCR workflows with AI-powered document understanding
GLM-OCR vision systems combined with FiftyOne provides a powerful, flexible solution for AI-powered document understanding. The integration's batching support, visualization capabilities, and multiple operation modes make it suitable for everything from quick prototypes to production pipelines.
Key takeaways:
- Fundamentally different: GLM-OCR understands document semantics, not just characters.
- Structure-first output: Get Markdown, JSON, and LaTeX directly—no parsing needed.
- Lightweight yet powerful: 0.9B parameters with SOTA accuracy (94.62 on OmniDocBench V1.5).
- Structured extraction: Extract domain-specific data via custom prompts.
- Open source: Apache-2.0 licensed, deployable anywhere.
- Developer-friendly: Easy integration with existing workflows via FiftyOne.
Whether you're building ETL pipelines, processing receipts, analyzing contracts, extracting data from scientific papers, or digitizing medical records, GLM-OCR with FiftyOne gives you the tools to build robust document understanding systems that output structured data ready for downstream processing.
GLM-OCR next steps
- Try it yourself: Run the example notebook with your own documents.
- Experiment with modes: Test different operation types on your use case.
- Build a pipeline: Integrate structured extraction into your workflow.
- Optimize: Tune batch sizes and parameters for your hardware.
- Extend: Combine with LLMs for end-to-end document understanding.
The future of AI document understanding is here—and it's more accessible than ever.
Resources