**ALT Text:** "Infographic comparing foundation models and custom-trained computer vision models, featuring SAM 2 and GroundingDINO workflows for AI-assisted labeling, active learning, model training, and production deployment. The graphic highlights when foundation models outperform custom training, compares accuracy, latency, model size, and cost, and illustrates a hybrid workflow where foundation models generate labels and custom models are fine-tuned for production use."

June 8, 2026

SAM 2 and GroundingDINO in Production: When Foundation Models Beat Custom Training

Two years ago, every serious computer vision project trained a custom model. In 2026, the default has flipped — you reach for SAM 2, GroundingDINO, or a fine-tuned CLIP variant first, and only fall back to custom training when the foundation model proves insufficient.

This post lays out the decision boundary.

Where foundation models clearly win

General-purpose detection and segmentation across common object categories. SAM 2 with text prompts will segment 'forklift', 'person wearing high-visibility vest', and 'pallet' in a warehouse scene with accuracy that took custom-trained YOLOv8 with 50,000 labels to match. GroundingDINO handles open-vocabulary detection — 'find anything that looks like a leaking pipe' — without any training.

Anywhere the prompt is the labeling, foundation models compress weeks of training into a single inference call.

Where they're getting close

Domain-specific tasks that are visually similar to common categories. Manufacturing defect detection of cracks, scratches, and contamination is increasingly viable with prompted foundation models. Retail catalog enrichment ('extract product color, material, brand from this image') is solidly in foundation-model territory. Medical imaging is mixed — SAM 2 does well at organ segmentation but specialized models still win on subtle pathology.

Where custom training still wins

Three patterns. First, edge deployment — a 4 billion parameter foundation model doesn't run on a manufacturing PLC. A distilled YOLOv8 at 30M parameters runs at 60 FPS on a Jetson. Second, latencycritical inference — foundation models cost 50-200ms per inference call, which is fine for batch but death for real-time. Third, regulated domains with narrow taxonomies — radiology models trained on millions of carefully-labeled images still outperform prompted foundation models on rare findings.

The hybrid pattern most teams converge on

Use a foundation model for AI-assisted labeling — let SAM 2 pre-label your dataset, then fine-tune a smaller custom model for production deployment. You get the foundation model's generalization in labeling and the custom model's speed in production. This pattern accounts for the majority of new computer vision projects in 2026.

Where the platform layer matters

Operating this pattern requires hosting foundation models for inference at labeling time, then hosting custom-trained models for inference at deployment time. Most teams stitch this manually — SAM 2 in one notebook, training in another, deployment via SageMaker. Intellabel's Growth and MLOps tiers host both: bring SAM 2 from Hugging Face for active learning, fine-tune a custom model on the labeled output, deploy through the same pipeline.

The economics also matter. SAM 2 hosted via API at scale costs more than running a distilled custom model. The right model for labeling is rarely the right model for production. Plan for two.

The single question to ask

Before starting any new vision project: can a foundation model with a clever prompt do this well enough? If the answer is yes, ship that, label whatever it gets wrong, fine-tune a smaller model from the labeled data, and deploy. The era of 'train from scratch first' is over for 60% of vision use cases. The remaining 40% still pay off, but you should know which side you're on before you commit six months to custom training.