This article is part of our coverage of the latest in AI research.
Embeddings are crucial to many deep learning applications, especially those using large language models (LLM). The quality of embeddings directly affects the performance of the models in different applications.
Ideally, you should be able to create custom embedding models for your applications. However, training embedding models comes with many challenges and difficulties. This is why developers usually use embedding models pre-trained for general applications.
A new paper by researchers at Microsoft proposes a technique that significantly reduces the costs and complexity of training custom embedding models. The technique uses open-source LLMs instead of BERT-like encoders to reduce the steps for retraining. It also uses proprietary LLMs to automatically generate labeled training data. This kind of research can unlock new LLM applications and enable organizations to create customized LLMs for their needs.
The challenges of training embedding models
Embedding models create numerical representations that capture the main features of the input data. For example, word embeddings capture the semantical meanings of words, and sentence embeddings capture the relationships between words in a sentence. Image embeddings represent the visual features of their inputs. Embeddings are useful for various tasks, such as comparing the similarity of two words, sentences, or texts.
One of the important applications of embeddings is retrieval augmented generation (RAG) with LLMs. In RAG, embeddings help find and retrieve documents that are relevant to a user’s prompt. The content of retrieved documents is inserted into the prompt and the LLM is instructed to generate its response based on the documents. RAG enables LLMs to avoid hallucinations and accomplish tasks involving information beyond its training dataset.
The quality of RAG is highly dependent on the quality of the embedding model. If the embeddings don’t capture the right features from the documents and match them to the user prompts, then the RAG pipeline will not be able to retrieve relevant documents.
Training embedding models on custom data is one of the methods to improve their quality for specific applications. But the current popular method used in popular embedding models is a multi-stage training process. First the model is trained on a large-scale dataset of weakly-supervised text pairs through contrastive learning. Then the model is fine-tuned on a small-scale but high-quality dataset of carefully labeled examples.
The problem with this approach is that it requires substantial engineering efforts to curate relevant text pairs. It also relies on manually collected datasets often that cover few tasks and languages. This is why, for the most part, developers use general embedding models that might not be suitable for their applications.
LLMs as embedding models
The new technique that Microsoft proposes trains embeddings in a single stage as opposed to the two-stage approach used in other models. For this, they rely on proprietary LLMs like GPT-4 to generate synthetic data for a diverse range of embedding tasks.
First, they prompt GPT-4 to create a small pool of candidate tasks. Next, they feed the candidate tasks to the model and prompt it to generate training examples.
I tested a simplified version of the prompt pipeline on ForeFront, a platform that enables you to create no-code workflows on top of GPT-4 and Claude. In the first stage, I received a list of abstract task descriptions:
[
"Retrieve case law that supports or opposes a specific legal argument.",
"Find recipes based on a list of ingredients provided in the query.",
"Search for news articles related to a particular event or phenomenon.",
"Locate scholarly articles that discuss the efficacy of a certain medical treatment.",
"Identify historical documents that mention a specific figure or event.",
"Gather financial reports related to a particular industry or company.",
… (redacted)
]
In the next stage, I provided one of the examples to GPT-4 and asked it to generate a JSON structure that includes a concrete user prompt along with 150-word positive and negative examples. The results were almost perfect, except for the last sentence in the hard negative, which can mess up the embeddings:
{
"user_query": "Looking for a vegetarian chili recipe that uses black beans, quinoa, and sweet potatoes",
"positive_document": "This hearty vegetarian chili is the perfect way to warm up on a chilly day. Begin by sautéing one diced onion and three cloves of minced garlic in olive oil until translucent. Add one diced bell pepper and two diced sweet potatoes and cook until slightly soft. Stir in two cups of vegetable broth, one can of black beans (rinsed and drained), one cup of cooked quinoa, one can of diced tomatoes, and one can of tomato paste. Season with two tablespoons of chili powder, one teaspoon of cumin, and salt and pepper to taste. Let the chili simmer for 30 minutes, stirring occasionally. Serve hot, garnished with avocado, cilantro, and a squeeze of lime juice.",
"hard_negative_document": "This delicious sweet potato and black bean burrito bowl is a great vegetarian meal. Start by roasting two diced sweet potatoes in the oven with olive oil, salt, and pepper until golden and tender. Meanwhile, cook one cup of quinoa as per package instructions. In a pan, cook one can of black beans with one diced onion, two cloves of garlic, and one tablespoon of cumin until heated through. Assemble your bowls by placing a scoop of quinoa at the bottom, followed by the roasted sweet potatoes, and the black bean mixture. Top with sliced avocado, fresh cilantro, and a dollop of sour cream. While this recipe shares ingredients with the user's query, it is not a chili recipe."
}
The researchers have not released any source code or data for their experiments. But you can see a very simplified version of the pipeline in this Python notebook that I created. Naturally, this is a very flexible process and you can easily customize the templates based on your needs.
To increase the diversity of the dataset, the researchers designed several prompt templates and combined them. Overall, they generated 500,000 examples with 150,000 unique instructions with GPT-3.5 and GPT-4 through Azure OpenAI Service. Their total token consumption was about 180 million, which would cost somewhere around $5,000.
Interestingly, the researchers used their training data to fine-tune an open-source autoregressive model instead of a bidirectional encoder like BERT, which is the norm. The premise is that since these models have been pre-trained on very large datasets, they can be fine-tuned for embedding tasks at very low costs.
They tested their method on Mistral-7B on the synthetic data and 13 public datasets. Using different techniques, like LoRA, they reduced training costs. They were able to obtain state-of-the-art results on popular benchmark datasets and even outperform OpenAI’s Ada-002 and Cohere’s embedding model on RAG and embedding quality benchmarks.
LLMs and embeddings
The main finding of the paper is that when training auto-regressive models such as Mistral-7B for embedding tasks, there is no need to undergo the expensive contrastive pre-training phase
“Extensive auto-regressive pre-training enables LLMs to acquire good text representations, and only minimal fine-tuning is required to transform them into effective embedding models,” they write.
Their findings also suggest that LLMs should be able to generate suitable training data to fine-tune embedding models at very low cost. This can have an important impact of future LLM applications, enabling organizations to create custom embeddings for their applications.
“We posit that generative language modeling and text embeddings are the two sides of the same coin, with both tasks requiring the model to have a deep understanding of the natural language,” the researchers write. “Given an embedding task definition, a truly robust LLM should be able to generate training data on its own and then be transformed into an embedding model through light-weight fine-tuning. Our experiments shed light on the potential of this direction, and more research is needed to fully explore it.”