Blog

Reduce the costs of GPT-4 with prompt compression

December 20, 2023

LLM prompt compression — Image generated with Bing Image Creator

This article is part of our coverage of the latest in AI research.

Large language models (LLM) like GPT-4 and Claude can learn new tasks through good prompt engineering. Yet, long prompts can increase the costs of using these models and also slow them down.

LLMLingua, a new technique developed by Microsoft, compresses prompts by eliminating irrelevant parts. Remarkably, LLMLingua can compress prompts up to 20 times without compromising the quality of the model’s response. If used well, LLMLingua can reduce the costs of using advanced LLMs and make them accessible to a broader range of users and applications.

The costs of prompt engineering

Prompt engineering is a cornerstone of leveraging LLMs for practical applications. Techniques such as chain-of-thought, in-context learning, and integrating related documents or historical conversations are instrumental in enhancing model performance for specific tasks. However, these methods typically require longer prompts, which can sometimes reach thousands of tokens. This can have a heavy impact on the costs of using advanced models, especially expensive LLMs like GPT-4.

While there are different ways to optimize models and reduce costs, one direction of research leverages the inherent redundancy in natural language to compress the prompts. Some approaches learn specialized tokens through prompt tuning to decrease the token count required during inference.

Yet, these methods are often task-specific and may require fine-tuning the entire model, limiting their use and making them incompatible with API-based models like ChatGPT.

Other techniques use LLMs to summarize conversations to create condensed memory and knowledge representations. But these methods typically involve multiple costly invocations of LLMs.

A notable method, Selective Context, uses a smaller language model to assess the informativeness of text segments, discarding less informative content to compress prompts. Microsoft’s latest technique builds on this method and enhances it.

LLMLingua

LLMLingua is an innovative technique that compresses prompts from coarse-grained to fine-grained levels. This method is composed of several components.

The first component, the “budget controller,” dynamically distributes different compression ratios to the elements of original prompts, such as the instruction, the demonstrations, and the question. The underlying principle here is that instructions and questions generally have a more direct impact on the generated outcomes, as they encapsulate the essential knowledge needed for the LLM to produce the answer. Conversely, when a prompt contains multiple demonstrations, the information might be repetitive. Hence, the budget controller allocates a greater budget—meaning smaller compression ratios—for instructions and questions, while assigning a smaller budget for demonstrations.

LLMLingua uses a smaller language model such as GPT-2 or LLaMA to manage this allocation. This model calculates the perplexity of each demonstration, which serves as a gauge for the text’s relevance to the model’s response. LLMLingua then prioritizes demonstrations with the highest perplexity values, incorporating them into the prompt until the token budget for demonstrations is met. The remaining budget is allocated to refining the instruction and the question.

The second component of LLMLingua is the iterative token-level prompt compression (ITPC) algorithm, which allows for a more fine-grained compression. ITPC begins by segmenting the prompt, and then uses the small model to determine the perplexity distribution across these segments. The algorithm then constructs a compressed prompt that retains tokens with high perplexity, ensuring that key information is preserved by considering the conditional dependencies between tokens.

The third component involves an instruction tuning-based method that synchronizes the distribution patterns of the large and small language models. This process starts with a pre-trained small language model, which is then fine-tuned using data generated by the larger LLM. Through instruction tuning, the small model’s behavior is aligned more closely with that of the large model, enhancing the overall compression process.

Testing LLMLingua

entropy — Image generated with Bing Image Creator

In their experiments, researchers employed GPT-3.5 Turbo and Claude 1.3 as the primary LLMs and used Alpaca-7B or GPT2-Alpaca for the compression tasks. They tested LLMLingua across various benchmarks, including GSM8k and BBH for reasoning and in-context learning, as well as ShareGPT and Arxiv-March23 for conversational context understanding and summarization tasks, respectively.

“Our proposed method consistently outperforms the prior methods by a large margin in almost all experiments,” the researchers report.

In the reasoning and in-context learning benchmarks of GSM8K and BBH, not only did LLMLingua achieve higher results compared to the full-shot approach, but it also attained remarkable compression ratios of 5x and 3x.

“This well demonstrates that our compressed prompts effectively retain the reasoning information contained in the original prompt,” the researchers write.

For the contextual understanding benchmarks on ShareGPT and Arxiv-March23, LLMLingua compressed prompts by 9x and 3.3x. This indicates that LLMLingua preserves the semantic integrity of the initial prompts while compressing them. Moreover, LLMLingua surpassed other prompt compression methods in both accuracy and the degree of compression. In some cases, it achieved up to 20x compression on the original prompt.

Despite the complexity of involving multiple steps and two models, LLMLingua managed to achieve a speedup ranging from 1.7x to 5.7x, with minimal computational overhead.

“Our approach holds substantial practical implications, as it not only reduces computational costs but also offers a potential solution for accommodating longer contexts in LLMs,” the researchers assert.

To enable wider adoption, Microsoft has made LLMLingua accessible through an easy-to-use open-source library. Developers can use this library to integrate LLMLingua into their own applications.

2 COMMENTS

Karthik Soman December 21, 2023 at 6:04 pm

Nice article! Here is another method called ‘KG-RAG’ that optimizes the token space of LLM using ‘prompt-aware context’ in a RAG framework.
https://github.com/BaranziniLab/KG_RAG

Loading...

Suat ATAN January 25, 2024 at 4:16 pm

Thank you for this article. I tried to use LLMLingua. However for compression it requires huge memory. Of course it is useful but not testable on local computer or colab. Do you know any memory-efficient ways? Best

Loading...

Understanding OpenAI’s pivot to releasing open source models

What to know about Google Gemini 2.5 Pro

How Open-Sora 2.0 cuts the costs of AI video generation without…

Google closes down on OpenAI with huge Gemini and Gemma 3…

How OpenAI is building its moat

How looking differently at data can save your machine learning project

Building a solid data foundation for generative AI applications

These mistakes can ruin your machine learning project

How to choose metrics for your machine learning projects

How to start with machine learning in your organization

What is Model Context Protocol (MCP)?

What to know about Claude 3.7 Sonnet, Anthropic’s new frontier language…

Everything you need to know about Grok-3

Understanding LLM ensembles and mixture-of-agents (MoA)

Demystifying DeepSeek-R1, the model that shocked the AI industry

The cash flow conundrum: How technology is reshaping small business finance

What to know about the security of open-source machine learning models

Understanding the impact of open-source language models

What we learned from the deep learning revolution

AI21 Labs’ mission to make large language models get their facts…

Reduce the costs of GPT-4 with prompt compression

The costs of prompt engineering

LLMLingua

Testing LLMLingua

Like this:

2 COMMENTS

Leave a ReplyCancel reply

The costs of prompt engineering

LLMLingua

Testing LLMLingua

Like this:

2 COMMENTS

Leave a ReplyCancel reply

Discover more from TechTalks