Blog

Meta SAM 2 is the most impressive object segmentation model

August 5, 2024

This article is part of our coverage of the latest in AI research.

Meta has just released its new Segment Anything Model 2 (SAM 2), which has not received the attention it deserves given the current focus on large language models (LLMs). SAM 2 can perform real-time image and video segmentation, and it can be applied to many domains without being fine-tuned with specific data.

And Meta has released the model weights, code, and the dataset used to train it, which will be very useful to the research and developer community. Here is how SAM 2 works and its possible impact on many industrial applications, including future generations of LLMs.

From SAM to SAM 2

Object segmentation is a complicated task that requires identifying all the pixels in an image that belong to an object. Creating object segmentation models was traditionally a very complex task that required technical expertise, a large volume of annotated data for the target application, and expensive machine learning training infrastructure.

Meta’s Segment Any Model (SAM), released in 2023, changed that by providing a model that could handle many use cases out of the box. SAM could take in “prompts,” which can include points, bounding boxes, or text, and detect which pixels belong to the object that corresponds to the prompt. This is the object segmentation equivalent of LLMs, which can do a lot of tasks without retraining.

SAM works by learning to match the encoding of the input image and prompt and transform it into colored pixels for each object. The model was trained on SA-1B, a dataset of 1 billion annotated images. An interesting fact about the data annotation process is that the researchers used an iterative process. They first trained an initial version of SAM on a set of annotated examples. They then used the model to help annotators speed up the annotation process for the next set of examples. And they used the new data to fine-tune SAM and enhance its performance and repeat the cycle at a faster speed.

SAM has already been used for various purposes, from consumer apps like Instagram to applications in science and medicine. SAM has also become an important part of the image labeling process, helping machine learning teams speed up the process of creating training examples for their specialized segmentation models.

SAM 2 improves SAM by adding a few components that make it better suited to detect the same object across a video. The challenge with object segmentation in videos is that objects might be deformed, occluded, or shown from different angles across different frames. SAM 2 adds memory components that enable the model to ensure consistency across frames.

The memory mechanism consists of a memory encoder, a memory bank, and a memory attention module. When applied to still images, the memory components are empty and the model behaves like SAM. When the model is used on videos, the memory components store information about the object and the user’s previous prompts. Users can add or remove prompts in different parts of the video to refine the model’s output. At each frame, the memory information conditions the model’s prediction from previous frames.

Meta SAM 2 architecture (source: Meta paper)

SAM 2 also comes with SA-V, a brand new dataset with a larger and richer set of training examples. SA-V contains more than 600,000 annotated images on around 51,000 videos. The videos feature real-world scenarios collected from 47 countries. The annotations include whole objects, object parts, and challenging scenarios such as instances where the object is partly occluded.

Like its predecessor, SA-V was annotated with the help of the model itself. The annotators used an early version of SAM 2 to annotate examples, then manually corrected the annotations and retrained the model. By repeating this process, they improved the model and increased the speed and quality of automated annotation.

According to Meta, “Annotation with our tool and SAM 2 in the loop is approximately 8.4 times faster than using SAM per frame and also significantly faster than combining SAM with an off-the-shelf tracker.”

SAM 2 in action

If you enjoyed this article, please consider supporting TechTalks with a paid subscription (and gain access to subscriber-only posts)

Subscribe to TechTalks

According to reports by the Meta research team, SAM 2 significantly outperforms previous approaches on interactive video segmentation across 17 zero-shot video datasets and requires approximately three times fewer human-in-the-loop interactions. SAM 2 also provides near–real-time inference with approximately 44 frames per second.

The researchers have made the code and weights for SAM 2 available under the Apache 2.0 license, which means you can use it for commercial purposes for free. They have also released the SA-V dataset. This move is part of Meta’s recent push for open sourcing its AI research, models and tools in response to the closed releases of companies such as OpenAI, Anthropic, and Google.

I’m excited about how developers and researchers will repurpose this model for specialized use cases. The model is already very efficient, coming in four sizes ranging from 39 to 224 megabytes, which is good enough to run on many edge devices such as laptops and smartphones. But a general-purpose model will hit roadblocks on very specialized applications or memory- and compute-constrained devices. I’m interested in seeing how SAM 2 and SA-V will help enterprises create tiny object segmentation models for specialized applications such as object detection on production lines in a specific factory. It will also be very useful for the autonomous driving industry, which requires a lot of annotated data and any percentage gain in annotation speed is a clear win.

It will also be interesting to see how models like SAM 2 can be combined with language models for more complex applications. Currently, most vision language models (VLM) work on raw pixel data and text. It will be interesting to see what we can achieve with a VLM trained on outputs from an object segmentation model or a combination of raw pixels and granular object segmentation. This could be particularly useful for robotics, where VLMs and the newer vision-language-action (VLA) models are making inroads.

As for Meta, we can expect SAM 2, Llama 3, and the next generation of AI innovation to find their way into some of the company’s most ambitious projects, including its augmented reality glasses.

Are we at the cusp of a new era for artificial…

What to know about o3 and o4-mini, OpenAI’s new reasoning models

GPT-4.1: OpenAI’s most confusing model

Making data work for you: Challenges, innovations, and lessons learned

Under the hood: The Innovations powering DeepSeek’s AI breakthrough

How to turbocharge your product and market research with DeepSearch

How looking differently at data can save your machine learning project

Building a solid data foundation for generative AI applications

These mistakes can ruin your machine learning project

How to choose metrics for your machine learning projects

How Google’s Agent2Agent can boost AI productivity through inter-agent communication

Demystifying vibe coding: Hype, reality, and why you still need to…

What is Model Context Protocol (MCP)?

What to know about Claude 3.7 Sonnet, Anthropic’s new frontier language…

Everything you need to know about Grok-3

The cash flow conundrum: How technology is reshaping small business finance

What to know about the security of open-source machine learning models

Understanding the impact of open-source language models

What we learned from the deep learning revolution

AI21 Labs’ mission to make large language models get their facts…

Meta SAM 2 is the most impressive object segmentation model

From SAM to SAM 2

SAM 2 in action

Like this:

Leave a ReplyCancel reply

From SAM to SAM 2

SAM 2 in action

Like this:

Leave a ReplyCancel reply

Discover more from TechTalks