Blog

LLMs don’t need all the attention layers, study shows

December 9, 2024

large robot vs small robot — Image created with Microsoft Copilot

This article is part of our coverage of the latest in AI research.

One of the key reasons for the success of the Transformer architecture, the backbone of today’s large language models (LLMs), is its “attention mechanism,” which has made it possible to scale the models to many layers and parameters (and got the original Transformer paper the title “Attention Is All You Need”). However, a new study by researchers at the University of Maryland shows that Transformer models can shed a substantial portion of their attention layers without degrading their performance. This results in smaller, faster, and more memory-efficient models.

Redundancy in Transformers

Transformer models are composed of uniform blocks stacked on top of each other. Each block contains attention layers and dense (MLP) layers.

In their study, the researchers explored redundancy in transformers at three levels: Transformer blocks, MLP layers, and attention layers. To measure redundancy, they used a similarity metric that compares the input and output of a module.

“The underlying hypothesis is that redundant modules produce outputs that are similar to their inputs, implying minimal transformation,” the researchers write. “In contrast, important modules are expected to significantly alter their inputs and thus should be preserved.”

Based on this metric, they pruned trained Transformer-based language models at each of the three levels.

They first considered dropping entire blocks deemed redundant (Block Drop). However, their experiments showed that Block Drop degrades performance significantly as it overlooks the internal fine-grained architectures within each block.

They developed two targeted techniques, MLP Drop and Attention Drop, that remove specific MLP or attention layers within blocks based on their redundancy score. Their findings show that dropping MLP layers hurts performance. But surprisingly, they were able to prune a substantial portion of attention layers without degrading the model’s performance.

For instance, on the Llama-2-70B model, they were able to remove 50% of the attention layers while preserving performance at a level comparable to the full model. The Llama-2-13B and Mistral-7B models maintained over 99% of their original performance even after dropping eight attention layers.

“These results demonstrate that attention layers are highly redundant, and their removal has minimal impact on model accuracy, making Attention Drop a highly efficient pruning strategy,” the researchers write.

Another benefit of the Attention Drop pruning method is that it reduces the size of the KV cache, the memory used to preserve attention computations to speed up inference. For instance, in Llama-2-13B, the KV cache is reduced from 52GB to 26GB, a 50% reduction.

The researchers also propose “Joint Layer Drop,” a flexible approach that removes both MLP and attention layers.

“By combining the importance scores of these layers, we find that jointly dropping low-importance Attention and MLP layers yields better performance under high sparsity conditions compared to pruning only one type of layer,” the researchers write.

For example, they found that Llama-2-13B still retains 90% of the performance on the MMLU benchmark even after dropping 31 attention and MLP layers.

One key insight from the paper is that the high redundancy in attention layers, particularly in deeper layers, can help guide future research on creating more compact models that have fewer layers while preserving their performance.

Model compression

This work is part of the broader efforts to compress large language models. The most widely used techniques are quantization and pruning. Pruning focuses on removing non-essential components of the model while quantization converts the model’s parameters into lower-precision data types.

One of the challenges of pruning is that it can disrupt the model’s structure and make it difficult to use accelerator hardware efficiently. Block Drop and Layer Drop, two of the methods proposed in the paper, remove structured modules rather than fine-grained parameters, making them hardware-friendly. The technique can also be combined with quantization methods to further improve the model’s efficiency.

Are we at the cusp of a new era for artificial…

What to know about o3 and o4-mini, OpenAI’s new reasoning models

GPT-4.1: OpenAI’s most confusing model

Making data work for you: Challenges, innovations, and lessons learned

Under the hood: The Innovations powering DeepSeek’s AI breakthrough

How looking differently at data can save your machine learning project

Building a solid data foundation for generative AI applications

These mistakes can ruin your machine learning project

How to choose metrics for your machine learning projects

How to start with machine learning in your organization

How Google’s Agent2Agent can boost AI productivity through inter-agent communication

Demystifying vibe coding: Hype, reality, and why you still need to…

What is Model Context Protocol (MCP)?

What to know about Claude 3.7 Sonnet, Anthropic’s new frontier language…

Everything you need to know about Grok-3

The cash flow conundrum: How technology is reshaping small business finance

What to know about the security of open-source machine learning models

Understanding the impact of open-source language models

What we learned from the deep learning revolution

AI21 Labs’ mission to make large language models get their facts…

LLMs don’t need all the attention layers, study shows

Redundancy in Transformers

Model compression

Like this:

Leave a ReplyCancel reply

Redundancy in Transformers

Model compression

Like this:

Leave a ReplyCancel reply

Discover more from TechTalks