Blog

How to improve the throughput of LLM application servers

March 12, 2024

super fast robot — Image generated with Bing Image Creator

This article is part of our coverage of the latest in AI research.

One of the methods to customize large language models (LLM) for specific applications is to provide them with detailed system prompts. As the context length of LLMs continues to grow, you can give them longer system prompts to improve their performance on your applications.

However, as system prompts become longer, they can slow down the language model and cause throughput bottlenecks. RelayAttention, a new technique developed by researchers at City University of Hong Kong and SenseTime Research, improves the efficiency of LLM services that involve long system prompts.

RelayAttention increases inference speed by removing the redundant memory access required to compute causal attention. And it does so without reducing the quality of the model’s responses.

System prompts and LLM costs

llm system prompt — The structure of an LLM system prompt and user context (source: arxiv)

LLM services usually use an application-specific system prompt to specify the task’s instructions. The system prompt can include various components, including instructions, domain-specific knowledge, and examples of conversations. All requests to the service share the system prompt. When users send a request, it is combined with the system prompt and sent to the LLM.

Long system prompts can heavily degrade the inference throughput and latency of the service. During inference, the transformer model predicts the next token that comes in the sequence.

In each time step, the LLM generates the next token by attending to all precedent tokens. Naturally, as the sequence becomes longer, the computation becomes more costly.

A common way to reduce the computation costs of LLM inference is to calculate the “Key-Value cache.” LLMs are autoregressive models, which means each new token depends only on its predecessors. The KV cache computes and stores the hidden values of earlier tokens in a sequence and reuses them for later generations. When generating a new token, instead of calculating the attention value for all tokens, the model only computes the value of the new token and adds it to the cache.

In LLM services that use application-wide system prompts, another way to optimize inference is to precompute the KV cache for the system prompt and use it across requests. Various techniques take advantage of inference calculation algorithms to reduce the computation and memory footprint of LLMs.

RelayAttention

RelayAttention proposes a new approach to improve the efficiency of LLM services that use long prompts.

“Our key observation is that there are not only redundant memory footprint and computations corresponding to the system prompt, but also unnecessary memory accesses during causal attention computation,” the researchers write.

While all requests in the application share the system prompt, current attention algorithms such as PagedAttention and FlashAttention read its hidden states (i.e., KV values) from memory once for each request in a batch.

“To eliminate such redundant memory access, we propose RelayAttention, an exact algorithm to compute causal attention based on a mathematical reformulation of it,” the researchers write.

If you enjoyed this article, please consider supporting TechTalks with a paid subscription (and gain access to subscriber-only posts)

Subscribe to TechTalks

The key idea of RelayAttention is to group the matrix-vector multiplications corresponding to the system prompt into matrix-matrix multiplications. This allows the LLM service to load the hidden states of the system prompt from DRAM once for all request tokens in a batch.

The process divides the computation of the causal attention layer into three steps: system attention step, context attention step, and relay fusion step.

“In the system attention and context attention steps, we compute two intermediate attention outputs as if the LLM is prompted by the shared system prompt / request-specific context only,” the researchers write. “In the relay fusion step, we compute the final output as a convex combination of the two intermediate outputs.”

The researchers implemented two main architectural innovations to improve RelayAttention within existing inference systems. First, instead of using a single KV cache for both the system prompt and the request-specific context, they used a separate cache to store system KVs and fill it offline before serving the model.

“This can be viewed as a combination of prefix sharing in PagedAttention, which eliminates redundant memory footprint of system KVs, and PromptCache, which eliminates redundant computation in the prompt phase,” they write.

Second, as the system KVs are already computed offline, they offset the position of request-specific context tokens by the length of the system prompt to make sure of correct position embedding.

RelayAttention in action

The researchers tested RelayAttention on the ShareGPT and MMLU datasets and three different GPUs. They used three different of vLLM, a popular open-source library designed for high-throughput LLM serving.

Two configurations use PagedAttention and PromptCaching, and the third one uses RelayAttention. The researchers released the code for using RelayAttention with vLLM.

Integrating RelayAttention into vLLM shows up to 2.2X sustainable request rate and 2X throughput with the Llama2-7B model for a chatbot workload. The researchers observed similar efficiency improvements for several other popular LLMs. The efficiency gains continue growing with longer system prompts. Interestingly, RelayAttention manages to maintain its throughput as the size of the prompt continues to increase, which is especially useful for new models that can support contexts that span hundreds of thousands of tokens.

relayattention vs pagedattention vs promptcaching — Throughput of RelayAttention vs other optimization techniques on different GPUs

It is worth noting that RelayAttention is especially suitable for batched inference. The larger the batch size, the more efficient RelayAttention becomes. On the other hand, when there is only one request, such as when running an LLM on-device, RelayAttention doesn’t help. “Therefore, RelayAttention is suitable for cloud-serving scenarios,” the researchers write.

As LLMs are deployed in different environments and devices, researchers are finding new ways to make them work faster and with less memory. RelayAttention is one of several techniques that optimize LLM inference. Recently, Apple introduced “LLM in a flash,” a technique for reducing the memory footprint of LLMs on edge devices such as laptops and smartphones. Another paper by researchers at ETH Zurich suggests rearranging the transformer architecture to remove unnecessary computations and increase inference speed by up to 300X.

Are we at the cusp of a new era for artificial…

What to know about o3 and o4-mini, OpenAI’s new reasoning models

GPT-4.1: OpenAI’s most confusing model

Making data work for you: Challenges, innovations, and lessons learned

Under the hood: The Innovations powering DeepSeek’s AI breakthrough

How to turbocharge your product and market research with DeepSearch

How looking differently at data can save your machine learning project

Building a solid data foundation for generative AI applications

These mistakes can ruin your machine learning project

How to choose metrics for your machine learning projects

How Google’s Agent2Agent can boost AI productivity through inter-agent communication

Demystifying vibe coding: Hype, reality, and why you still need to…

What is Model Context Protocol (MCP)?

What to know about Claude 3.7 Sonnet, Anthropic’s new frontier language…

Everything you need to know about Grok-3

The cash flow conundrum: How technology is reshaping small business finance

What to know about the security of open-source machine learning models

Understanding the impact of open-source language models

What we learned from the deep learning revolution

AI21 Labs’ mission to make large language models get their facts…

How to improve the throughput of LLM application servers

System prompts and LLM costs

RelayAttention

RelayAttention in action

Like this:

Leave a ReplyCancel reply

System prompts and LLM costs

RelayAttention

RelayAttention in action

Like this:

Leave a ReplyCancel reply

Discover more from TechTalks