My first year teaching high school science, I had a room full of kids who'd already failed the class at least once, no textbook, and no time to teach everything. So I pulled a stack of old state exams and ran a text analysis to find what actually mattered, and almost every test came down to the same handful of core concepts. The rest was noise. I curated first, taught only that core signal, and 98 percent of my students passed.
Building a machine learning training set takes the same move. Annotation teams face the same pressure I did: tight deadlines, limited budgets, and more raw data than they can label well. Label all of it evenly and most of the budget goes to data the model already handles, missing the rare cases that actually move performance. So you curate first, then label only the data that carries the signal. The labeling itself is the easy 10 percent. The 90 percent is choosing what to label and making sure the labels are true. Curate first, annotate second.
What data annotation is
Data annotation is the process of attaching structured labels to raw data so a machine learning model can learn the patterns those labels describe. Raw pixels carry no notion of a car, a lane line, or a defect until someone marks them. Draw a box around each product on a grocery shelf, tag it with a name, and the image becomes a training example. Compile enough of them and you have a dataset that teaches what unlabeled data cannot.
Those labels are the ground truth, and they do double duty. In training, they are what the model learns from. In evaluation, a held-out set becomes the standard its predictions are scored against. Same artifact, two jobs.
That’s why label quality sets a ceiling on everything downstream, and why a bad label costs you twice. Train on the wrong labels and the model faithfully learns the wrong thing. Evaluate against the wrong labels and you can’t even see the error, because the measure itself is wrong. A model cannot learn a distinction its labels never drew, and it cannot be measured against a standard that is wrong in the first place.
Labels are the most powerful thing in the stack, and the most overlooked. They're also the part you actually own. Everyone can buy the same architecture and rent the same compute, so the model itself is becoming a commodity. The labeled data is what makes your model yours. The field has caught up to this. In our 2026 State of Visual and Physical AI report (709 practitioners), 89 percent said that as models and compute turn into commodities, data is what decides whether a project succeeds.
Neglecting that data carries a measured cost. A Google Research study found that 92 percent of practitioners hit data cascades: compounding, hard-to-trace failures that start with undervalued data quality and surface much later as a broken model.
The main types of data annotation
Data annotation comes in many forms, and the right one depends on what you need the model to learn.
The most common tasks map straight onto the label types: detection needs bounding boxes, classification needs class labels, segmentation needs masks, pose estimation needs keypoints, and tracking needs object tracking.
Classification assigns a single label to an entire image or clip, answering one question about it, such as whether a scan is normal or abnormal or a product is damaged or intact. It's the simplest type and usually the fastest to produce.
Bounding boxes draw a rectangle around each object of interest and tag it with a class, which makes them the workhorse of object detection. They're quick to draw and good enough for locating things, even though a rectangle is a loose fit for most real shapes.
Polygons trace the exact outline of an object to capture its true shape, which matters when shape carries information, such as separating one overlapping animal from another.
Segmentation masks label every pixel in an image, assigning each one a class. Semantic segmentation marks each pixel with a class such as road or sidewalk, and instance segmentation also separates each individual object. This pixel-level precision is what scene understanding, medical imaging, and many autonomous systems rely on.
Keypoints mark specific landmarks on a subject, such as the joints of a human body for pose estimation.
Polylines annotate connected linear structures, such as lane markings, road boundaries, or pipelines.
Cuboids and 3D boxes extend bounding boxes into three dimensions for LiDAR and point-cloud data, where an object has depth as well as height and width.
Object tracking carries an object's identity across the frames of a video, so the model learns motion and persistence across the whole clip.
Event annotation marks when something happens along a timeline, which is how teams label time-series and sensor streams.
The modalities teams actually work with
The stereotype of annotation is a person drawing boxes on photographs. The reality is far more varied, and that variety is where a lot of the difficulty hides. Physical AI, the work of building systems that perceive and act in the real world, is multimodal by nature.
In the 2026 State of Visual and Physical AI survey, images still dominate at 92 percent of teams, with video close behind at 63 percent. But plenty of teams are well past photographs: 42 percent label time-series data from sensors and inertial measurement units, 38 percent work with 3D point clouds, meshes, and LiDAR, and 25 percent handle audio. And 68 percent work across three or more modalities at once. Only 6 percent work in just one.
Each modality is almost its own subject, with its own labeling conventions, its own failure modes, and its own coverage gaps. A bad label on a 2D box is one kind of problem. A misaligned lidar return, or a dropped reading in one of several synchronized sensor streams, is a harder problem, and easier to miss. Every modality you add multiplies the work of keeping labels correct and consistent.
How a modern annotation workflow runs
A mature annotation workflow runs on three moves: curate, annotate, evaluate. Each one feeds the next.
Curation is deciding what to label, and it carries the most leverage. Raw data is lopsided. Some of it is near-duplicate footage the model already handles in its sleep, and some of it is the rare, weird, edge-case examples that could actually move performance. Spend your budget evenly across both and you mostly pay to reteach the model things it already knows. Find the slice that carries the signal before you spend a dollar labeling.
Annotation is the labeling itself, plus the two things that make or break it: a clear schema, the list of classes and the rules for applying them, and guidelines clean enough that two people label the same thing the same way. In 2026 a model usually drafts the first pass while a human confirms, fixes, and reviews for the inconsistencies that quietly poison a dataset. The 2026 State of Visual and Physical AI report puts model-assisted labeling at 51 percent of teams, climbing to 70 percent among teams with models in production.
Evaluation is where you train the model, see where it fails, and trace each failure back to the data that would fix it. Those gaps become the next thing you curate, which is what closes the loop and makes the dataset sharper on every pass. A model that keeps failing the same way is pointing you upstream, to what it was trained on.
The loop in practice: curate what to label, annotate it, then evaluate to find what's wrong and feed it back.
The catch is that the loop only works if the moves stay connected. The signal about what to label next is produced when you evaluate, and it needs to loop back into curation. Run those steps in separate tools that do not talk, and the signal leaks out in the handoff. Teams then fall back on the only move a broken loop allows: label and pray. The teams who keep the loop intact pull ahead. In the 2026 State of Visual and Physical AI report, 58 percent of self-described exceptional teams spend more than half their project time on data work, against just 21 percent of struggling teams, a nearly 3x gap.
Manual, automated, and agentic labeling
How labels get made has changed fast, and it keeps changing. The range runs from applying every label by hand to handing most of the work to a system you supervise.
Manual labeling is a human applying every label by hand. It is the most flexible, and for genuinely novel or high-stakes data, often the most reliable. It is also the slowest and the most expensive, which is why few teams still do it for everything.
Automated labeling hands the first pass to a model, usually a foundation model like Meta's Segment Anything family, so a human reviews and corrects a draft instead of starting from a blank frame. It works better than most people expect. In our auto-labeling research, models trained purely on auto-generated labels reached up to 95 percent of human-label performance on standard detection benchmarks, at up to 100,000 times lower cost. The model handles the bulk, and human attention goes to the rare and hard cases that actually need judgment.
Active learning puts the model in charge of choosing what to label. It flags the examples whose labels would teach it the most, usually the ones it is least sure about, so you spend human effort where it pays off. About 29 percent of teams we surveyed already use it.
Agentic labeling is the next step, and it is arriving now. Instead of a model that drafts one image at a time, an agent takes an instruction in plain language, plans the labeling across a whole dataset, pre-labels what it can, flags what it is unsure about, and routes the real judgment calls to a human.
The throughline across all four is the same. Each step moves human effort up the chain, from drawing boxes toward deciding what is worth drawing and whether the result is true. The numbers follow: 57 percent of teams expect to need less labeled data each year to hit the same performance. The job is shifting from labeling more to labeling smarter.
The shift that matters most in 2026
For years, annotation got treated as a throughput problem. Label faster, label cheaper, label more. But volume was never what was broken.
The data is brutal about it. In our 2026 State of Visual and Physical AI report, every single team, 100 percent, reported underperforming models, and the top five causes were all about data, led by too little training data for the scenarios that matter and plain bad data quality. Architecture barely registered. And the waste is staggering: more than a third say most of the data they pay to label never ships. They are buying labels by the thousand and throwing most of them away.
Labeling 3.4 million objects cost $1.18 by auto-labeling, against roughly $124,000 by hand.
The report's own verdict is blunt: "The waste isn't an annotation problem. It's a data curation problem." Teams pour the budget evenly over redundant data and starve the rare cases that would actually move the model, then pay a second time to review the pile.
In 2026, data annotation comes down to this: the model learns what you choose to put in front of it, and only as well as you labeled it. The two decisions that govern everything sit upstream and downstream of the labeling itself: what you teach, and whether it is true.
To be fair, this is not a universal law. If you are early, with barely any labeled data, you are on the steep part of the curve where more labels really do help, and you should label more until you hit the bend. But most teams are well past that. They are sitting on sprawling, redundant datasets where the next label teaches the model almost nothing. Vision performance tends to rise only logarithmically with the amount of data, so doubling the labels rarely doubles the gain. For everyone past the early days, and that is most of us, labeling smarter beats labeling more.
Frequently asked questions
What is data annotation in machine learning?
Data annotation is the process of labeling raw data, such as images, video, point clouds, or sensor streams, so a machine learning model can learn from it. The labels act as ground truth: the examples the model trains on, and the standard its predictions are graded against.
What is the difference between data annotation and data labeling?
In everyday use, data annotation and data labeling mean the same thing, and most practitioners use the terms interchangeably. Where people draw a line, labeling means attaching a single class to an item, while annotation also covers richer markup like bounding boxes, segmentation masks, keypoints, and 3D cuboids.
What are the main types of data annotation?
The main types of data annotation are classification, bounding boxes, polygons, segmentation masks, keypoints, polylines, cuboids and 3D boxes, object tracking, and event annotation. The right type depends on what you need the model to learn, from a single class label on an image to pixel-level masks across synchronized sensor streams.
What is the difference between data curation and data annotation?
Data curation is deciding which data is worth labeling, while data annotation is applying the labels to the data you have chosen. Curation comes first and carries the most leverage, because labeling redundant or low-value data wastes budget no matter how accurately each item is labeled.
Is data annotation still necessary now that foundation models exist?
Yes, data annotation is still necessary. Foundation models like Meta's Segment Anything family can pre-label data automatically, but they have seen little LiDAR, sensor, or domain-specific data, so humans are still needed to review and correct their output and to choose what is worth labeling. In a 2026 survey of 709 practitioners, 95 percent said human involvement is still required throughout model development.
How much labeled data do I actually need?
Less than you might expect, if you choose well. Model accuracy tends to rise only logarithmically with dataset size, so labeling the right examples matters more than labeling the most, and a majority of teams now expect to need less labeled data each year to reach the same performance.
What is agentic labeling?
Agentic labeling is an emerging approach where an AI agent takes a plain-language instruction, plans the labeling across a dataset, pre-labels what it can, flags what it is unsure about, and routes the genuine judgment calls to a human. It extends automated and model-assisted labeling from predicting one label at a time to running the labeling workflow itself.
C. Sun, A. Shrivastava, S. Singh, A. Gupta. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. ICCV 2017: https://arxiv.org/abs/1707.02968