Blog

Nvidia’s Hymba is an efficient SLM that combines state-space models and transformers

December 2, 2024

Hymba model — Image created with DALL-E 3

This article is part of our coverage of the latest in AI research.

Nvidia researchers have introduced Hymba, a family of small language models (SLMs) that achieve state-of-the-art performance by combining the strengths of transformers and state space models (SSMs). Hymba reportedly achieves high accuracy with significantly lower computational costs than traditional transformer-based LLMs.

Transformers have become the dominant architecture for LLMs. However, their computational and memory requirements grow quadratically with the length of the input sequence, making them challenging to scale and deploy in resource-constrained environments and edge devices.

State space models (SSMs) offer a more efficient alternative to transformers. SSMs scale with linear complexity, making them significantly faster and more memory-efficient for long sequences. However, SSMs are limited in recalling information and often underperform compared to transformers, especially on tasks that require fine-grained memory access.

Several studies have explored combining transformers and SSMs in hybrid models, often by stacking transformer and SSM layers. However, these hybrid approaches can introduce bottlenecks when a task is not well-suited for one of the layer types.

Instead of stacking transformer and SSM layers, Hymba integrates attention heads and SSM heads within the same layer, creating a parallel and complementary processing pathway for the input sequence. This “hybrid-head” architecture allows Hymba to leverage the strengths of both transformer attention and SSMs.

Hymba block — The Hymba architecture combines transformer attention and state-space attention in a single block (source: arxiv)

The researchers compare the hybrid-head architecture to human brain functions. The attention heads, providing high-resolution recall, are like snapshot memories that store vivid recollections of specific moments. The SSM heads, summarizing context through a constant cache, function like fading memories, gradually forgetting details while retaining the core or gist of past events.

To further enhance efficiency, Hymba employs several optimizations, including a combination of local and global attention mechanisms and a cross-layer KV cache-sharing strategy. These optimizations reduce memory overhead and improve processing speed, especially for long sequences.

Full Hymba architecture with cache optimization (source: arxiv)

Another innovation introduced in Hymba is the use of “learnable meta tokens.” These special tokens are added to the beginning of input sequences and ensure the quality of the attention values across sequences of different lengths. The researchers describe meta tokens as a “compressed representation of world knowledge and improve performance across both general and recall-intensive tasks.”

Hymba vs other models — Hymba accuracy, cache size, and throughput vs other SLMs (source: arxiv)

The researchers trained Hymba models with different parameter sizes, ranging from 125 million to 1.5 billion parameters and evaluated their performance on seveal tasks, including commonsense reasoning, recall-intensive tasks, math problem-solving, function calling, and role-playing.

The results show that Hymba models achieve state-of-the-art performance for their size, outperforming other small LLMs and even surpassing larger models in some benchmarks. Notably, the Hymba-1.5B model, trained on 1.5 trillion tokens, achieved the best performance among all sub-2-billion parameter LLMs that were trained on larger corpora and demonstrated better throughput and cache efficiency compared to all transformer-based LLMs.

Hymba performance — Hymba’s cache size is 19X smaller and its throughput is 2.79X larger than SLMs with similar performance (source: arxiv)

Hymba not only sets new performance benchmarks but also demonstrates significant efficiency gains. For example, on commonsense reasoning tasks, Hymba-1.5B outperformed Llama-3.2-3B by 1.32% while requiring a cache size that is more than 11 times smaller and 3.49 times faster.

To further optimize Hymba for practical applications, the researchers used supervised fine-tuning and direct preference optimization techniques. This resulted in the Hymba-1.5B-Instruct model, which achieves best-in-class performance on several benchmarks, including GSM8K, GPQA, and the Berkeley function-calling leaderboard, surpassing the 1-billion parameter version of Llama-2.

Hymba is the latest architecture that uses state-space models. It will be interesting if it will prove to be a practical tool for on-device language models.

Are we at the cusp of a new era for artificial…

What to know about o3 and o4-mini, OpenAI’s new reasoning models

GPT-4.1: OpenAI’s most confusing model

Making data work for you: Challenges, innovations, and lessons learned

Under the hood: The Innovations powering DeepSeek’s AI breakthrough

How to turbocharge your product and market research with DeepSearch

How looking differently at data can save your machine learning project

Building a solid data foundation for generative AI applications

These mistakes can ruin your machine learning project

How to choose metrics for your machine learning projects

How Google’s Agent2Agent can boost AI productivity through inter-agent communication

Demystifying vibe coding: Hype, reality, and why you still need to…

What is Model Context Protocol (MCP)?

What to know about Claude 3.7 Sonnet, Anthropic’s new frontier language…

Everything you need to know about Grok-3

The cash flow conundrum: How technology is reshaping small business finance

What to know about the security of open-source machine learning models

Understanding the impact of open-source language models

What we learned from the deep learning revolution

AI21 Labs’ mission to make large language models get their facts…

Nvidia’s Hymba is an efficient SLM that combines state-space models and transformers

Like this:

Leave a ReplyCancel reply

Like this:

Leave a ReplyCancel reply

Discover more from TechTalks