Nvidia’s Hymba is an efficient SLM that combines state-space models and transformers

Hymba model
Image created with DALL-E 3

This article is part of our coverage of the latest in AI research.

Nvidia researchers have introduced Hymba, a family of small language models (SLMs) that achieve state-of-the-art performance by combining the strengths of transformers and state space models (SSMs). Hymba reportedly achieves high accuracy with significantly lower computational costs than traditional transformer-based LLMs.

Transformers have become the dominant architecture for LLMs. However, their computational and memory requirements grow quadratically with the length of the input sequence, making them challenging to scale and deploy in resource-constrained environments and edge devices.

State space models (SSMs) offer a more efficient alternative to transformers. SSMs scale with linear complexity, making them significantly faster and more memory-efficient for long sequences. However, SSMs are limited in recalling information and often underperform compared to transformers, especially on tasks that require fine-grained memory access.

Several studies have explored combining transformers and SSMs in hybrid models, often by stacking transformer and SSM layers. However, these hybrid approaches can introduce bottlenecks when a task is not well-suited for one of the layer types. 

Instead of stacking transformer and SSM layers, Hymba integrates attention heads and SSM heads within the same layer, creating a parallel and complementary processing pathway for the input sequence. This “hybrid-head” architecture allows Hymba to leverage the strengths of both transformer attention and SSMs.

Hymba block
The Hymba architecture combines transformer attention and state-space attention in a single block (source: arxiv)

The researchers compare the hybrid-head architecture to human brain functions. The attention heads, providing high-resolution recall, are like snapshot memories that store vivid recollections of specific moments. The SSM heads, summarizing context through a constant cache, function like fading memories, gradually forgetting details while retaining the core or gist of past events.

To further enhance efficiency, Hymba employs several optimizations, including a combination of local and global attention mechanisms and a cross-layer KV cache-sharing strategy. These optimizations reduce memory overhead and improve processing speed, especially for long sequences.

Full Hymba architecture
Full Hymba architecture with cache optimization (source: arxiv)

Another innovation introduced in Hymba is the use of “learnable meta tokens.” These special tokens are added to the beginning of input sequences and ensure the quality of the attention values across sequences of different lengths. The researchers describe meta tokens as a “compressed representation of world knowledge and improve performance across both general and recall-intensive tasks.”

Hymba vs other models
Hymba accuracy, cache size, and throughput vs other SLMs (source: arxiv)

The researchers trained Hymba models with different parameter sizes, ranging from 125 million to 1.5 billion parameters and evaluated their performance on seveal tasks, including commonsense reasoning, recall-intensive tasks, math problem-solving, function calling, and role-playing.

The results show that Hymba models achieve state-of-the-art performance for their size, outperforming other small LLMs and even surpassing larger models in some benchmarks. Notably, the Hymba-1.5B model, trained on 1.5 trillion tokens, achieved the best performance among all sub-2-billion parameter LLMs that were trained on larger corpora and demonstrated better throughput and cache efficiency compared to all transformer-based LLMs.

Hymba performance
Hymba’s cache size is 19X smaller and its throughput is 2.79X larger than SLMs with similar performance (source: arxiv)

Hymba not only sets new performance benchmarks but also demonstrates significant efficiency gains. For example, on commonsense reasoning tasks, Hymba-1.5B outperformed Llama-3.2-3B by 1.32% while requiring a cache size that is more than 11 times smaller and 3.49 times faster.

To further optimize Hymba for practical applications, the researchers used supervised fine-tuning and direct preference optimization techniques. This resulted in the Hymba-1.5B-Instruct model, which achieves best-in-class performance on several benchmarks, including GSM8K, GPQA, and the Berkeley function-calling leaderboard, surpassing the 1-billion parameter version of Llama-2.

Hymba is the latest architecture that uses state-space models. It will be interesting if it will prove to be a practical tool for on-device language models.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.