In recent months, frontier artificial intelligence companies and research labs have made impressive progress in extending the context windows of large language models (LLM). The context window is the length of the input that the model can process. The longer the context window, the more information and instructions you can fit into the prompt given to the model.
In a few years, context windows have gone from 2,048 tokens in GPT-3 to one million tokens in Gemini 1.5 Pro. New techniques are promising to further extend the memory of LLMs to infinite tokens. Improved attention techniques enable LLMs to respond about very specific information in very long stretches of text, also referred to as the “needle in a haystack” test.
With LLMs supporting longer context, one question that is being raised often is whether we will ever need to fine-tune LLMs or use retrieval-augmented generation (RAG). Those are efforts that— while very effective and useful—sometimes require extensive engineering efforts.
As with many other things regarding LLMs, the answer is both “yes” and “no.” LLMs can obviate the need for many engineering efforts in the early stages of projects. But when scaling the usage of models, you will need to revisit tried and tested optimization techniques.
Infinite context vs fine-tuning
Fine-tuning LLMs requires several stages. You start by collecting and labeling your training data. Then you need to choose a model that suits your needs, set up a compute cluster, and then write and run the code for fine-tuning. With the advent of fine-tuning services, you can now fine-tune your model through an API service without the need to set up your own GPUs. However, you still need to control the training process, such as the number of epochs as well as model evaluation.
In contrast, with infinite-context LLMs, you can adjust the model’s behavior through prompt engineering. A recent paper by Google DeepMind explores the ability of many-shot in-context learning (ICL), which is made possible by the growing context window of LLMs. Basically, by inserting hundreds or thousands of input-output examples into the prompt, you can get the model to do things that previously required fine-tuning.
The technical entrance barrier of prompt engineering is very low and is accessible to anyone who has access to a model. Even people without software development experience can use techniques such as many-shot ICL to configure an LLM for their needs.
Infinite context vs RAG
Retrieval-augmented generation (RAG) is even more technical than fine-tuning. First, you need to break your document(s) into manageable chunks, compute their embeddings, and store them in a vector database. Then you need to create a prompt pipeline that calculates the embedding of the user’s request, retrieves the relevant document chunks from the vector store, and adds their content to the prompt before passing it to the model.
To improve the RAG pipeline, you must use more advanced techniques such as reranking, multi-hop retrieval, and creating customized embedding models.
In contrast, with infinite attention, you can simply dump all your documents into the prompt and try out different instructions that enable the model to pick the relevant parts and use them for its response. Frontier models now allow you to load several books’ worth of data into the prompt. And they’re very good at pinpointing specific information for their answers.
This means that, for example, you can insert the entire documentation for a programming library into your prompt and get the model to help you write code with the library.
LLMs and engineering
The general trend with LLMs is reducing the entrance barrier to creating machine learning systems. Thanks to the zero-shot, few-shot, and now many-shot learning capabilities of LLMs, you can get them to do tasks that previously required days or weeks of engineering. For example, you can create a full sentiment analysis system with LLMs such as GPT-4 or Claude 3 without training any models and with minimal coding.
If you enjoyed this article, please consider supporting TechTalks with a paid subscription (and gain access to subscriber-only posts)
Longer contexts will continue this trend and obviate the need for engineering efforts for complicated tasks. However, long- and infinite-context LLMs are not a silver bullet.
Creating successful products and applications does not only rely on creating a proof-of-concept that solves a problem. It will also require creating a system that can work at scale.
For example, when you’re working with tens or hundreds of inference requests during prototyping, cost and inference speed will not be much of a problem. But when you are processing tens of millions of requests per day, adding or removing a few tokens from each prompt can have a considerable impact on compute, memory, and financial costs.
Fine-tuning, RAG, and all the techniques and tools that have been created to support them serve these purposes. For example, low-rank adaptation (LoRA) enables you to create hundreds or thousands of fine-tuned LLMs without the need to store billions of parameters for each model. These techniques can be game-changers for high-usage applications.
As frontier labs continue to push the capabilities of LLMs, they will simplify the creation of AI application concepts. Product teams will be able to create and iterate prototypes without the need for an ML team. This will accelerate the process of reaching product-market fit. But as you move beyond proof-of-concept, you can’t underestimate the value of good engineering skills and a talented team that can create a reliable and scalable ML pipeline.
As HyperWrite AI CEO Matt Shumer eloquently put it, “Prompt your way to PMF. Then fine-tune to scale.”