What SAM 3 Means for Data Annotation

Jun 29, 2026
8 min read
SAM 3 segmentation masks covering every pedestrian in a crowded crosswalk, each person highlighted in a different color from a single concept prompt.
SAM 3 segments every pedestrian in the scene from a single concept prompt, with no clicking required.
SAM 3, or Segment Anything Model 3, is a Meta foundation model that detects, segments, and tracks every instance of a concept in images and video from a single text prompt. Meta released it in November 2025, with a SAM 3.1 update following in March 2026.
Segmentation has long been the slowest and most expensive part of data annotation to produce by hand and SAM 3 makes a first pass on it close to free. That speed reshapes the economics of labeling, but it leaves the two questions that decide model quality untouched: what to label, and whether the labels are right.

Key takeaways

  • SAM 3 turns a text prompt into mask-everything segmentation. Describe a concept in plain language and SAM 3 finds, segments, and tracks every matching instance across an image or video, where SAM and SAM 2 segmented one object per click.
  • Pre-labeling collapses the cost of the slowest label type. In a peer-reviewed agricultural study, SAM pre-labeling cut hand-segmentation from 9.7 seconds per mask to under 2, and Meta's own SAM 3 data engine more than doubled annotation throughput.
  • SAM 3 still cannot choose what to label or confirm a label is right. SAM 3 moves the bottleneck from drawing masks to curation and review, where human judgment still decides whether your trained model improves.

A quick recap: SAM and SAM 2

Meta released the original Segment Anything Model in 2023, and it made interactive segmentation feel effortless. You click an object, and SAM returns a pixel-accurate mask.
SAM 2 extended segmentation from images to video in 2024, tracking a segmented object across frames. Both models were a leap for annotation, because they turned the slow work of tracing object boundaries into a single click.
Both models shared one limit: you had to point at things. Each prompt segmented one object you indicated with a click or a box, so an image with forty cars meant forty clicks.

What’s new in SAM 3

SAM 3, or Segment Anything Model 3, is Meta's open-vocabulary segmentation model, and its defining change is promptable concept segmentation. Meta introduced it in the paper Segment Anything with Concepts, removing the one-object-per-prompt limit.
You describe a concept in plain language, like "yellow school bus," and SAM 3 segments and tracks every instance of it across an image or video at once. You can also prompt with example images instead of, or alongside, text.
This is the shift from pointing at one object to naming a category and getting all of it. SAM 3 is open-vocabulary, so it does not depend on a fixed list of classes, and it keeps SAM 2's earlier modes. SAM 3 still supports click-and-box visual prompting, and it can run fully automatically to propose every mask in a scene.
SAM 3 pairs a detector and a tracker that share a single vision encoder, at roughly 848 million parameters. Meta reports about a 2x gain over prior systems on concept segmentation, while preserving SAM 2's interactive performance.
The release also shipped SA-Co, a benchmark covering more than 270,000 unique concepts across images and videos, plus a companion model, SAM 3D, that reconstructs 3D objects and human bodies from a single image. SAM 3 is open source, with code and checkpoints in the facebookresearch/sam3 repository, and Meta's research write-up covers the details.
From a click to a sentence, SAM 3 now covers every way you might prompt it.

How the SAM line has changed

How the Segment Anything line evolved from SAM to SAM 3, where a single text prompt replaces one click per object.
Side by side comparison of a crowded crosswalk. The left panel, labeled "SAM 2: One click, one mask," shows a single orange umbrella segmented. The right panel, labeled "SAM 3: One prompt, every instance," shows every pedestrian segmented with a distinct colored mask.
The shift in one frame: SAM 2 returns one mask per click, while SAM 3 masks every instance of a concept from a single prompt.

The SAM 3.1 update (March 2026)

On March 27, 2026, Meta released SAM 3.1, a drop-in replacement for SAM 3. Its main addition, Object Multiplex, tracks many objects in a single forward pass and roughly doubles video throughput for real-time use.
SAM 3.1 leaves the concept-segmentation capability that matters for annotation untouched. It just makes that capability faster to run at scale.

What SAM 3 changes for annotation

For annotation teams, a large slice of labeling that used to require human clicks can now start from a text prompt. Want every pedestrian, every pallet, every solar panel masked? Describe it, and let SAM 3 produce a first pass across the whole dataset.
That accelerates a trend already underway. Foundation-model pre-labeling, where a model proposes labels for a human to review rather than starting from a blank frame, has grown more capable for years.
In the 2026 State of Visual and Physical AI Report, a survey of 709 practitioners, 51% of teams already use model-assisted labeling, rising to 70% among teams with models in production. SAM 3 makes that step faster and broader, especially for segmentation, the most time-consuming label type to produce by hand.
The clearest proof of that scale is SAM 3 itself. To build the model, Meta did not hand-label its training set.
An early version of SAM 3 proposed masks while AI verifiers, fine-tuned Llama models, and human annotators checked them, producing more than 4 million unique concepts across roughly 5.2 million images. The AI verifiers handled most routine checks and more than doubled throughput, which freed people for the hard cases SAM 3 got wrong.
In a peer-reviewed study, 14 agricultural experts used an open-source tool called ARAMSAM to pre-label segmentation masks with SAM 1 and SAM 2. Manual polygon drawing took 9.7 seconds per mask.
SAM-assisted pre-labeling cut that to 1.6 to 2.1 seconds, a five-to-sixfold gain, and moved the human from drawing masks to refining model-generated ones. SAM 3 pushes that pattern further, since a single text prompt now seeds the masks.
Bar chart titled "Pre-labeling cuts time per mask." Manual polygon drawing takes 9.7 seconds per mask, SAM 1 assisted takes 2.1 seconds (4.6x faster), and SAM 2 assisted takes 1.6 seconds (6.1x faster). Source: ARAMSAM, Frontiers in AI, 2026.
In a study of 14 agricultural experts, SAM-assisted pre-labeling cut segmentation from 9.7 seconds per mask to under 2. Source: ARAMSAM, Frontiers in AI, 2026.

The one thing SAM 3 does not change

A more capable pre-labeling model does not answer the two questions that actually decide whether your model improves: are you labeling the right data, and are the labels correct?
SAM 3 can mask every instance of a concept, but it cannot tell you which images were worth labeling. SAM 3 will also make mistakes a human has to catch, especially on the rare, cluttered, or domain-specific data where models are weakest.
The ARAMSAM study makes this concrete. SAM 2's automatic mask generator scored an F2 of 0.05 out of the box and reached 0.74 only after expert tuning, so the masks needed human judgment before they were usable.
A Cornell-led analysis, The SAM2-to-SAM3 Gap, goes further. It finds that SAM 3's concept-driven design introduces new semantic failure modes, and that the prompt-engineering know-how teams built around SAM 2 does not carry over.
When a capable model pre-labels at scale, the bottleneck does not vanish. The bottleneck moves to review and selection. The 2026 State of Visual and Physical AI Report makes the cost concrete: even with today's automation, 36% of teams say less than half of the data they annotate ever reaches production, because teams label too much of the wrong data.
This pattern holds across the whole pipeline. In the same report, 95% of teams say they need regular human involvement throughout model development, and even among the most mature teams in production, only 9% report minimal involvement.
Better automation redistributes that human effort toward selection and review. It does not remove it.
So treat SAM 3 as an accelerator for the labeling step, inside a workflow that still depends on human judgment for what to label and whether the labels hold up.
Meta's own data engine is the proof. Even with the best concept-segmentation model available, Meta kept its human annotators on the hard cases and the quality checks, because that judgment is what made the labels worth training on.
Foundation-model pre-labeling is table stakes now. Everyone has access to a strong segmentation model. That makes the model itself the commodity, and the real differentiator becomes who curates the right data and verifies quality fastest.
We make that case in Annotation Is a Data-Quality Problem, Not a Labeling Problem [ADD INTERNAL LINK] and Why Curation Beats More Labeling [ADD INTERNAL LINK]. For how teams store and move the masks SAM 3 produces, see Annotation Data Formats Explained [ADD INTERNAL LINK].

Using SAM 3 in FiftyOne

You can run SAM 3 pre-labeling where your dataset already lives. FiftyOne integrates SAM 2 natively for prompt-based segmentation and mask propagation, and offers SAM 3 today through a community plugin in the FiftyOne Model Zoo. Our team's guide to using SAM 3 in FiftyOne walks through a first pass end to end.
The payoff is the part SAM 3 cannot do alone. FiftyOne for annotation brings curation, pre-labeling, and review into one workflow, so you choose the right data, pre-label it with SAM 3, and verify quality before training.

Frequently asked questions

What is SAM 3?

SAM 3, or Segment Anything Model 3, is a Meta foundation model released in November 2025 that detects, segments, and tracks objects in images and video. Its defining new capability is promptable concept segmentation: you describe a concept in text or with example images and get every matching instance segmented at once.

How is SAM 3 different from SAM 2?

SAM and SAM 2 segmented one object per visual prompt, such as a click or a box. SAM 3 adds open-vocabulary concept prompting, so a single text prompt like "yellow school bus" segments every instance of that concept across an image or video, while still supporting the older click-based modes.

Does SAM 3 replace human annotators?

SAM 3 does not replace human annotators. It accelerates the labeling step by pre-labeling data, but humans still decide which data is worth labeling and review and correct SAM 3's mistakes, especially on rare or domain-specific data. SAM 3 moves the bottleneck to selection and review rather than removing it.

Can I use SAM 3 for data annotation today?

You can use SAM 3 for data annotation today. It is open source and can pre-label images and video, with humans reviewing the results. The value comes from pairing it with a workflow that curates the right data first and verifies label quality afterward.

What is SAM 3.1?

SAM 3.1 is a March 2026 update to SAM 3 that introduces Object Multiplex, a shared-memory approach that tracks many objects in a single forward pass for faster real-time video segmentation. It is a drop-in replacement for SAM 3 and does not change the core promptable-concept-segmentation capability.

Sources

Talk to a computer vision expert

Loading related posts...