This article is part of our coverage of the latest in AI research.
Large language models (LLM) like GPT-4 and Claude can learn new tasks through good prompt engineering. Yet, long prompts can increase the costs of using these models and also slow them down.
LLMLingua, a new technique developed by Microsoft, compresses prompts by eliminating irrelevant parts. Remarkably, LLMLingua can compress prompts up to 20 times without compromising the quality of the model’s response. If used well, LLMLingua can reduce the costs of using advanced LLMs and make them accessible to a broader range of users and applications.
The costs of prompt engineering
Prompt engineering is a cornerstone of leveraging LLMs for practical applications. Techniques such as chain-of-thought, in-context learning, and integrating related documents or historical conversations are instrumental in enhancing model performance for specific tasks. However, these methods typically require longer prompts, which can sometimes reach thousands of tokens. This can have a heavy impact on the costs of using advanced models, especially expensive LLMs like GPT-4.
While there are different ways to optimize models and reduce costs, one direction of research leverages the inherent redundancy in natural language to compress the prompts. Some approaches learn specialized tokens through prompt tuning to decrease the token count required during inference.
Yet, these methods are often task-specific and may require fine-tuning the entire model, limiting their use and making them incompatible with API-based models like ChatGPT.
Other techniques use LLMs to summarize conversations to create condensed memory and knowledge representations. But these methods typically involve multiple costly invocations of LLMs.
A notable method, Selective Context, uses a smaller language model to assess the informativeness of text segments, discarding less informative content to compress prompts. Microsoft’s latest technique builds on this method and enhances it.
LLMLingua
LLMLingua is an innovative technique that compresses prompts from coarse-grained to fine-grained levels. This method is composed of several components.
The first component, the “budget controller,” dynamically distributes different compression ratios to the elements of original prompts, such as the instruction, the demonstrations, and the question. The underlying principle here is that instructions and questions generally have a more direct impact on the generated outcomes, as they encapsulate the essential knowledge needed for the LLM to produce the answer. Conversely, when a prompt contains multiple demonstrations, the information might be repetitive. Hence, the budget controller allocates a greater budget—meaning smaller compression ratios—for instructions and questions, while assigning a smaller budget for demonstrations.
LLMLingua uses a smaller language model such as GPT-2 or LLaMA to manage this allocation. This model calculates the perplexity of each demonstration, which serves as a gauge for the text’s relevance to the model’s response. LLMLingua then prioritizes demonstrations with the highest perplexity values, incorporating them into the prompt until the token budget for demonstrations is met. The remaining budget is allocated to refining the instruction and the question.
The second component of LLMLingua is the iterative token-level prompt compression (ITPC) algorithm, which allows for a more fine-grained compression. ITPC begins by segmenting the prompt, and then uses the small model to determine the perplexity distribution across these segments. The algorithm then constructs a compressed prompt that retains tokens with high perplexity, ensuring that key information is preserved by considering the conditional dependencies between tokens.
The third component involves an instruction tuning-based method that synchronizes the distribution patterns of the large and small language models. This process starts with a pre-trained small language model, which is then fine-tuned using data generated by the larger LLM. Through instruction tuning, the small model’s behavior is aligned more closely with that of the large model, enhancing the overall compression process.
Testing LLMLingua
In their experiments, researchers employed GPT-3.5 Turbo and Claude 1.3 as the primary LLMs and used Alpaca-7B or GPT2-Alpaca for the compression tasks. They tested LLMLingua across various benchmarks, including GSM8k and BBH for reasoning and in-context learning, as well as ShareGPT and Arxiv-March23 for conversational context understanding and summarization tasks, respectively.
“Our proposed method consistently outperforms the prior methods by a large margin in almost all experiments,” the researchers report.
In the reasoning and in-context learning benchmarks of GSM8K and BBH, not only did LLMLingua achieve higher results compared to the full-shot approach, but it also attained remarkable compression ratios of 5x and 3x.
“This well demonstrates that our compressed prompts effectively retain the reasoning information contained in the original prompt,” the researchers write.
For the contextual understanding benchmarks on ShareGPT and Arxiv-March23, LLMLingua compressed prompts by 9x and 3.3x. This indicates that LLMLingua preserves the semantic integrity of the initial prompts while compressing them. Moreover, LLMLingua surpassed other prompt compression methods in both accuracy and the degree of compression. In some cases, it achieved up to 20x compression on the original prompt.
Despite the complexity of involving multiple steps and two models, LLMLingua managed to achieve a speedup ranging from 1.7x to 5.7x, with minimal computational overhead.
“Our approach holds substantial practical implications, as it not only reduces computational costs but also offers a potential solution for accommodating longer contexts in LLMs,” the researchers assert.
To enable wider adoption, Microsoft has made LLMLingua accessible through an easy-to-use open-source library. Developers can use this library to integrate LLMLingua into their own applications.