Promtriever trains LLMs for information retrieval and instruction following

llama reading documents
Image created with DALL-E 3

This article is part of our coverage of the latest in AI research.

Information retrieval is an important part of many applications built on top of large language models (LLMs), especially if they use bespoke knowledge not included in the model’s training data. Therefore, the industry is actively looking for the development of better information retrievers (IR) that can improve the performance of their LLM applications.

Promptriever, a new technique introduced by researchers at Johns Hopkins University and Samaya AI, claims to be the “first retrieval model able to be prompted like a language model” and “the first zero-shot promptable retriever.” In addition to retrieving relevant information, Promptriever can also follow instructions included in the prompt.

The problem with current IR systems

To understand the benefits of Promptriever, let’s first consider how classic IR systems work. In LLM applications that use external information sources, the user’s query is first parsed and matched with a datastore to retrieve relevant data chunks to add to the prompt as contextual data.

Current IR systems usually match users’ queries with documents based on a single semantic similarity score. For example, a bi-encoder model calculates the embedding of the user’s query and the target document, then performs a dot product on the two values to evaluate their similarities. There are several techniques that adapt pre-trained LLMs to retrieval tasks. However, when the model undergoes IR training, it loses other capabilities it may have previously had, such as instruction following. There have been some efforts to preserve the instruction-following capabilities of IR-trained LLMs, but those are mostly limited to instruction templates as opposed to open-ended natural language commands.

Information retreival models
Information retreival models (source: arXiv)

What is Promptriever?

Promptriever introduces a training paradigm that preserves the model’s ability to follow instructions on instances. The model the researchers introduce in their paper is a bi-encoder based on the Llama-2 7B model, but the training recipe can be applied to other LLMs.

Like other IR recipes, Promptriever starts with a pre-trained model, such as an instruction-tuned version of Llama 2. However, they use richer training examples when fine-tuning the model for information retrieval. The classic IR training paradigm is to provide the model with pairs of user queries and matching document chunks. In Promptriever training examples, the user query is augmented with instructions to include the kind of information that the retrieved document should have. These instructions are not generic and are specific to that instance.

Moreover, in addition to the matching documents, the training instances also include “instruction negative” examples, which are chunks of information that might be related to the topic of the query but not relevant to the question that the user is asking. This is a common problem in retrieval-augmented generation (RAG) applications. The documents you receive often have very rich content but do not answer your questions. Including instruction negatives in the training set helps condition the model on both the original query and the instructions. 

Promptriever
Promptriever (source: arXiv)

Creating the dataset

To train the model, the researchers created a dataset of about 500,000 examples, a challenging task that they accomplished through an interesting approach. They started with a subset of the popular MS MARCO search and question-answering dataset. They used the Llama-3-70B-Instruct to add extra information to the queries. They then used GPT-4o to generate several instance negative examples for each instance and vetted the examples with FollowIR-7B, an information retrieval model that has been trained to follow complex instructions. Those that were irrelevant were discarded and not added to the dataset. The researchers released the dataset on Hugging Face for other researchers to use.

According to the findings of the researchers, Promptriever outperforms other bi-encoder models and scores close to the best cross-encoder models on instruction-following tasks. Promptriever achieves this level of performance without penalizing its standard retrieval quality. Another interesting characteristic of Promptriever is the ability to use advanced prompting techniques, making it possible to further tweak the model’s performance at inference time.

“Our results demonstrate that with the right training data, modern bi-encoders can be instructed/prompted in free-form natural language, in a similar manner to LMs. We hope this alignment between the LM and IR communities will allow for further improvements to dense retrieval models,” the researchers write.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.