Blog

DeepMind releases benchmark for evaluating long-context LLMs

July 1, 2024

robot holding stack of papers — Image created with Microsoft Copilot

This article is part of our coverage of the latest in AI research.

Large language models (LLM) with very long context windows make it easier to create advanced AI applications with simple prompting techniques and without the need for complex tools and pipelines. However, evaluating the performance of long-context LLMs is still an unexplored area that needs more investigation.

In a new paper, Google DeepMind has introduced a benchmark called Long-Context Frontiers (LOFT) to rigorously evaluate the performance of long-context language models (LCML). LOFT has been designed for tasks with very long prompts.

LOFT can be a great tool to evaluate and compare LLMs as context windows expand to millions of tokens.

Long-context language models

The limited context window of LLMs previously required specialized techniques to customize the models for new tasks. For example, if the model cannot perform the task through few-shot learning, you need to fine-tune the LLM. And if you wanted to add proprietary information to the prompt, you would need to use a retrieval-augmented generation (RAG) pipeline to choose specific bits of information from your corpus that are relevant to the task.

With long-context language models, you can just insert your entire corpus or training examples into the prompt and have the model learn the task or choose the parts it needs to solve the problem. You can further increase the capabilities of the model using techniques such as adding instructions and chain-of-thought reasoning.

Current evaluation methods for LCLMs include the “needle-in-a-haystack” test and fixed-length datasets that haven’t been designed for long-context models.

“Critically, existing evaluations do not adequately stress-test LCLMs on any paradigm-shifting tasks,” the researchers write.

Long-Context Frontiers (LOFT)

Long-Context Frontiers (LOFT) is a suite of six tasks consisting of 35 datasets that include text, visual, and audio modalities. It is designed to gauge LCLMs on real-world tasks. LOFT currently supports 32k-, 128k-, and 1M-token context windows. But it also supports the automatic creation of increasing context lengths as LCLMs continue to scale.

LOFT evaluates several important domains, including:

– Retrieval and RAG: Can the model reason over an entire knowledge corpus without the need to select specific parts for each prompt? Can it handle complex prompts that usually require multi-hop retrieval and adaptation to multiple examples?

– SQL: Can the model ingest an entire database and reason over the content of the database tables and other structured data without the need for another mechanism to make it ingestible for the model?

– In-context learning: Can the model learn difficult tasks through demonstrations included in the prompt? Does its performance continue to improve as more demonstrations are added to the prompt? Can you just add more examples to the prompt instead of finding an optimal set of few-shot examples?

Corpus-in-Context (CiC) Prompting

corpus-in-context prompting — Corpus-in-Context prompting (CiC) (source: arxiv)

LOFT aims to open up a new line of research on long-context prompting, which DeepMind introduces as “Corpus-in-Context” (CiC) Prompting.

CiC combines several prompting strategies to activate the capabilities of LCLMs for learning, retrieving, and reasoning over in-context corpora. A CiC prompt is composed of several parts. The first part includes task-specific instructions, such as telling the model to read the corpus carefully and find relevant documents to answer the question.

Next, the entire knowledge corpus is inserted into the prompt, and each piece of information (e.g., passage, image) is assigned an ID. The model will use these IDs in retrieval and reasoning tasks.

The corpus is followed by few-shot learning examples that include chain-of-thought reasoning. All examples are grounded in the knowledge corpus to steer the model toward using the in-context knowledge.

Finally, the new query is added at the end of the prompt in the same format as the few-shot examples.

One key advantage of CiC prompting is its compatibility with prefix-caching in autoregressive language models, which means the corpus only needs to be encoded once. For each new request, we only compute the attention values for the new query, and we can reuse the cached attention values of the corpus, few-shot examples, and instructions.

LOFT in Action

DeepMind evaluated Gemini 1.5 Pro (1M context), GPT-4o (128k context), and Claude 3 Opus (200k context) on LOFT, comparing them against fine-tuned models and model pipelines designed for the target task.

The results revealed that LCLMs rival the performance of many specialized models. The researchers found that at 128k tokens, LCLMs rival the performance of Gecko, a leading textual retrieval system. In visual retrieval tasks, Gemini 1.5 Pro outperforms CLIP across all visual benchmarks and context lengths. In audio retrieval, Gemini 1.5 Pro is comparable to PaLM 2 DE across five languages. In SQL tasks, LCLMs achieve reasonable performance, though they are significantly behind specialized pipelines.

The researchers also found that LCLMs lag significantly on complex multi-hop compositional reasoning tasks. And ablation studies revealed that the models’ performance varied depending on prompting strategies such as chain-of-thought reasoning. There is still much to learn about optimizing LCLMs for tasks with large in-context corpora.

“Our results on LOFT demonstrate that LCLMs can match the performance of many specialized models, while also revealing ample headroom for improvement in robust long-context reasoning as context windows continue to scale,” the researchers write. “We believe that LOFT provides a fertile testing ground for measuring progress in long-context modeling.”

What to know about o3 and o4-mini, OpenAI’s new reasoning models

GPT-4.1: OpenAI’s most confusing model

Making data work for you: Challenges, innovations, and lessons learned

Under the hood: The Innovations powering DeepSeek’s AI breakthrough

What to know about Meta’s Llama 4 model family

How looking differently at data can save your machine learning project

Building a solid data foundation for generative AI applications

These mistakes can ruin your machine learning project

How to choose metrics for your machine learning projects

How to start with machine learning in your organization

How Google’s Agent2Agent can boost AI productivity through inter-agent communication

Demystifying vibe coding: Hype, reality, and why you still need to…

What is Model Context Protocol (MCP)?

What to know about Claude 3.7 Sonnet, Anthropic’s new frontier language…

Everything you need to know about Grok-3

The cash flow conundrum: How technology is reshaping small business finance

What to know about the security of open-source machine learning models

Understanding the impact of open-source language models

What we learned from the deep learning revolution

AI21 Labs’ mission to make large language models get their facts…

DeepMind releases benchmark for evaluating long-context LLMs

Long-context language models

Long-Context Frontiers (LOFT)

Corpus-in-Context (CiC) Prompting

LOFT in Action

Like this:

Leave a ReplyCancel reply

Long-context language models

Long-Context Frontiers (LOFT)

Corpus-in-Context (CiC) Prompting

LOFT in Action

Like this:

Leave a ReplyCancel reply

Discover more from TechTalks