This article is part of our coverage of the latest in AI research.
Large language models (LLM) with very long context windows make it easier to create advanced AI applications with simple prompting techniques and without the need for complex tools and pipelines. However, evaluating the performance of long-context LLMs is still an unexplored area that needs more investigation.
In a new paper, Google DeepMind has introduced a benchmark called Long-Context Frontiers (LOFT) to rigorously evaluate the performance of long-context language models (LCML). LOFT has been designed for tasks with very long prompts.
LOFT can be a great tool to evaluate and compare LLMs as context windows expand to millions of tokens.
Long-context language models
The limited context window of LLMs previously required specialized techniques to customize the models for new tasks. For example, if the model cannot perform the task through few-shot learning, you need to fine-tune the LLM. And if you wanted to add proprietary information to the prompt, you would need to use a retrieval-augmented generation (RAG) pipeline to choose specific bits of information from your corpus that are relevant to the task.
With long-context language models, you can just insert your entire corpus or training examples into the prompt and have the model learn the task or choose the parts it needs to solve the problem. You can further increase the capabilities of the model using techniques such as adding instructions and chain-of-thought reasoning.
Current evaluation methods for LCLMs include the “needle-in-a-haystack” test and fixed-length datasets that haven’t been designed for long-context models.
“Critically, existing evaluations do not adequately stress-test LCLMs on any paradigm-shifting tasks,” the researchers write.
Long-Context Frontiers (LOFT)
Long-Context Frontiers (LOFT) is a suite of six tasks consisting of 35 datasets that include text, visual, and audio modalities. It is designed to gauge LCLMs on real-world tasks. LOFT currently supports 32k-, 128k-, and 1M-token context windows. But it also supports the automatic creation of increasing context lengths as LCLMs continue to scale.
LOFT evaluates several important domains, including:
– Retrieval and RAG: Can the model reason over an entire knowledge corpus without the need to select specific parts for each prompt? Can it handle complex prompts that usually require multi-hop retrieval and adaptation to multiple examples?
– SQL: Can the model ingest an entire database and reason over the content of the database tables and other structured data without the need for another mechanism to make it ingestible for the model?
– In-context learning: Can the model learn difficult tasks through demonstrations included in the prompt? Does its performance continue to improve as more demonstrations are added to the prompt? Can you just add more examples to the prompt instead of finding an optimal set of few-shot examples?
Corpus-in-Context (CiC) Prompting
LOFT aims to open up a new line of research on long-context prompting, which DeepMind introduces as “Corpus-in-Context” (CiC) Prompting.
CiC combines several prompting strategies to activate the capabilities of LCLMs for learning, retrieving, and reasoning over in-context corpora. A CiC prompt is composed of several parts. The first part includes task-specific instructions, such as telling the model to read the corpus carefully and find relevant documents to answer the question.
Next, the entire knowledge corpus is inserted into the prompt, and each piece of information (e.g., passage, image) is assigned an ID. The model will use these IDs in retrieval and reasoning tasks.
The corpus is followed by few-shot learning examples that include chain-of-thought reasoning. All examples are grounded in the knowledge corpus to steer the model toward using the in-context knowledge.
Finally, the new query is added at the end of the prompt in the same format as the few-shot examples.
One key advantage of CiC prompting is its compatibility with prefix-caching in autoregressive language models, which means the corpus only needs to be encoded once. For each new request, we only compute the attention values for the new query, and we can reuse the cached attention values of the corpus, few-shot examples, and instructions.
LOFT in Action
DeepMind evaluated Gemini 1.5 Pro (1M context), GPT-4o (128k context), and Claude 3 Opus (200k context) on LOFT, comparing them against fine-tuned models and model pipelines designed for the target task.
The results revealed that LCLMs rival the performance of many specialized models. The researchers found that at 128k tokens, LCLMs rival the performance of Gecko, a leading textual retrieval system. In visual retrieval tasks, Gemini 1.5 Pro outperforms CLIP across all visual benchmarks and context lengths. In audio retrieval, Gemini 1.5 Pro is comparable to PaLM 2 DE across five languages. In SQL tasks, LCLMs achieve reasonable performance, though they are significantly behind specialized pipelines.
The researchers also found that LCLMs lag significantly on complex multi-hop compositional reasoning tasks. And ablation studies revealed that the models’ performance varied depending on prompting strategies such as chain-of-thought reasoning. There is still much to learn about optimizing LCLMs for tasks with large in-context corpora.
“Our results on LOFT demonstrate that LCLMs can match the performance of many specialized models, while also revealing ample headroom for improvement in robust long-context reasoning as context windows continue to scale,” the researchers write. “We believe that LOFT provides a fertile testing ground for measuring progress in long-context modeling.”