Meta SAM 2 is the most impressive object segmentation model

Meta SAM 2
Object segmentation with Meta SAM 2 model

This article is part of our coverage of the latest in AI research.

Meta has just released its new Segment Anything Model 2 (SAM 2), which has not received the attention it deserves given the current focus on large language models (LLMs). SAM 2 can perform real-time image and video segmentation, and it can be applied to many domains without being fine-tuned with specific data.

And Meta has released the model weights, code, and the dataset used to train it, which will be very useful to the research and developer community. Here is how SAM 2 works and its possible impact on many industrial applications, including future generations of LLMs.

From SAM to SAM 2

Object segmentation is a complicated task that requires identifying all the pixels in an image that belong to an object. Creating object segmentation models was traditionally a very complex task that required technical expertise, a large volume of annotated data for the target application, and expensive machine learning training infrastructure. 

Meta’s Segment Any Model (SAM), released in 2023, changed that by providing a model that could handle many use cases out of the box. SAM could take in “prompts,” which can include points, bounding boxes, or text, and detect which pixels belong to the object that corresponds to the prompt. This is the object segmentation equivalent of LLMs, which can do a lot of tasks without retraining. 

SAM works by learning to match the encoding of the input image and prompt and transform it into colored pixels for each object. The model was trained on SA-1B, a dataset of 1 billion annotated images. An interesting fact about the data annotation process is that the researchers used an iterative process. They first trained an initial version of SAM on a set of annotated examples. They then used the model to help annotators speed up the annotation process for the next set of examples. And they used the new data to fine-tune SAM and enhance its performance and repeat the cycle at a faster speed.

SAM architecture
SAM architecture (source: Meta blog)

SAM has already been used for various purposes, from consumer apps like Instagram to applications in science and medicine. SAM has also become an important part of the image labeling process, helping machine learning teams speed up the process of creating training examples for their specialized segmentation models. 

SAM 2 improves SAM by adding a few components that make it better suited to detect the same object across a video. The challenge with object segmentation in videos is that objects might be deformed, occluded, or shown from different angles across different frames. SAM 2 adds memory components that enable the model to ensure consistency across frames. 

The memory mechanism consists of a memory encoder, a memory bank, and a memory attention module. When applied to still images, the memory components are empty and the model behaves like SAM. When the model is used on videos, the memory components store information about the object and the user’s previous prompts. Users can add or remove prompts in different parts of the video to refine the model’s output. At each frame, the memory information conditions the model’s prediction from previous frames.

Meta SAM 2 architecture
Meta SAM 2 architecture (source: Meta paper)

SAM 2 also comes with SA-V, a brand new dataset with a larger and richer set of training examples. SA-V contains more than 600,000 annotated images on around 51,000 videos. The videos feature real-world scenarios collected from 47 countries. The annotations include whole objects, object parts, and challenging scenarios such as instances where the object is partly occluded. 

Like its predecessor, SA-V was annotated with the help of the model itself. The annotators used an early version of SAM 2 to annotate examples, then manually corrected the annotations and retrained the model. By repeating this process, they improved the model and increased the speed and quality of automated annotation.

According to Meta, “Annotation with our tool and SAM 2 in the loop is approximately 8.4 times faster than using SAM per frame and also significantly faster than combining SAM with an off-the-shelf tracker.”

SAM 2 in action

According to reports by the Meta research team, SAM 2 significantly outperforms previous approaches on interactive video segmentation across 17 zero-shot video datasets and requires approximately three times fewer human-in-the-loop interactions. SAM 2 also provides near–real-time inference with approximately 44 frames per second.

The researchers have made the code and weights for SAM 2 available under the Apache 2.0 license, which means you can use it for commercial purposes for free. They have also released the SA-V dataset. This move is part of Meta’s recent push for open sourcing its AI research, models and tools in response to the closed releases of companies such as OpenAI, Anthropic, and Google. 

I’m excited about how developers and researchers will repurpose this model for specialized use cases. The model is already very efficient, coming in four sizes ranging from 39 to 224 megabytes, which is good enough to run on many edge devices such as laptops and smartphones. But a general-purpose model will hit roadblocks on very specialized applications or memory- and compute-constrained devices. I’m interested in seeing how SAM 2 and SA-V will help enterprises create tiny object segmentation models for specialized applications such as object detection on production lines in a specific factory. It will also be very useful for the autonomous driving industry, which requires a lot of annotated data and any percentage gain in annotation speed is a clear win.

It will also be interesting to see how models like SAM 2 can be combined with language models for more complex applications. Currently, most vision language models (VLM) work on raw pixel data and text. It will be interesting to see what we can achieve with a VLM trained on outputs from an object segmentation model or a combination of raw pixels and granular object segmentation. This could be particularly useful for robotics, where VLMs and the newer vision-language-action (VLA) models are making inroads.

As for Meta, we can expect SAM 2, Llama 3, and the next generation of AI innovation to find their way into some of the company’s most ambitious projects, including its augmented reality glasses.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.