
June 8, 2026
Two years ago, every serious computer vision project trained a custom model. In 2026, the default has flipped — you reach for SAM 2, GroundingDINO, or a fine-tuned CLIP variant first, and only fall back to custom training when the foundation model proves insufficient.
This post lays out the decision boundary.
General-purpose detection and segmentation across common object categories. SAM 2 with text prompts will segment 'forklift', 'person wearing high-visibility vest', and 'pallet' in a warehouse scene with accuracy that took custom-trained YOLOv8 with 50,000 labels to match. GroundingDINO handles open-vocabulary detection — 'find anything that looks like a leaking pipe' — without any training.
Anywhere the prompt is the labeling, foundation models compress weeks of training into a single inference call.
Domain-specific tasks that are visually similar to common categories. Manufacturing defect detection of cracks, scratches, and contamination is increasingly viable with prompted foundation models. Retail catalog enrichment ('extract product color, material, brand from this image') is solidly in foundation-model territory. Medical imaging is mixed — SAM 2 does well at organ segmentation but specialized models still win on subtle pathology.
Three patterns. First, edge deployment — a 4 billion parameter foundation model doesn't run on a manufacturing PLC. A distilled YOLOv8 at 30M parameters runs at 60 FPS on a Jetson. Second, latencycritical inference — foundation models cost 50-200ms per inference call, which is fine for batch but death for real-time. Third, regulated domains with narrow taxonomies — radiology models trained on millions of carefully-labeled images still outperform prompted foundation models on rare findings.
Use a foundation model for AI-assisted labeling — let SAM 2 pre-label your dataset, then fine-tune a smaller custom model for production deployment. You get the foundation model's generalization in labeling and the custom model's speed in production. This pattern accounts for the majority of new computer vision projects in 2026.
Operating this pattern requires hosting foundation models for inference at labeling time, then hosting custom-trained models for inference at deployment time. Most teams stitch this manually — SAM 2 in one notebook, training in another, deployment via SageMaker. Intellabel's Growth and MLOps tiers host both: bring SAM 2 from Hugging Face for active learning, fine-tune a custom model on the labeled output, deploy through the same pipeline.
The economics also matter. SAM 2 hosted via API at scale costs more than running a distilled custom model. The right model for labeling is rarely the right model for production. Plan for two.
Before starting any new vision project: can a foundation model with a clever prompt do this well enough? If the answer is yes, ship that, label whatever it gets wrong, fine-tune a smaller model from the labeled data, and deploy. The era of 'train from scratch first' is over for 60% of vision use cases. The remaining 40% still pay off, but you should know which side you're on before you commit six months to custom training.