The context window problem or why LLM forgets the middle of a long file

Alex Kostikov

3 months ago

In a recent interview, former Google CEO Eric Schmidt said that the context window of large language models (LLMs) can be used as short-term memory. But there is a problem—if you load a fairly long text (for example, several books) into the context window, the AI will forget the middle.

According to Eric Schmidt, this is what makes AI similar to people, because the human brain (according to him) behaves the same way.

Here is why this statement is doubly wrong.

What is a context window?

The volume of text that LLM can view and refer to when generating new text is called a context window. It is important to understand that this is not the entire array of data on which the language model was trained, but rather the relatively small information that is directly used to generate a response.

What is the cause of the problem?

From a mathematical point of view, a neural network is not a search engine, but a statistical data analyzer. The neural network predicts semantically related words and meanings based on the amount of data it was trained on. As a working tool, LLM uses weight parameters, which are essentially special vectors.

For this reason, any text enters LLM as a set of numbers (this process is called embedding). At the same time, after fragmentation, any text segment, regardless of the actual letter size, enters LLM as a digital sequence of the same length.

Why?

Because this set of numbers is not a description of the meaning or a code of content, but the coordinates of a point in a vector database. Moreover, the higher the dimensionality used in the database, the longer the sequence of numbers encoding the position of this point will be. For some advanced language models, this digital sequence looks frighteningly long (for example, the coordinates of a point in a 3072-dimensional space).

But despite the apparent complexity, in fact, these are just coordinates in a high-dimensional Euclidean space—nothing more.

In the AI vector database, these points are locally grouped by semantic meanings. Conventionally, fears are near horrors, and delights are near pleasant things. As a result, the mathematical meaning of the neural network comes down to finding the closest coordinate points (meanings). In this case, the answer points must be contextually related both to the query points and to other points of the generated response.

Simply put, the closer the meaning, the closer the points, and the higher the spatial dimensionality of the LLM model, the more accurate the semantic analysis of the data.

Combine two different ropes

As a result, this process resembles a parallel connection of two chaotically intertwined lines, discretely folded from individual coordinate points—meanings.

Try throwing a long-tangled rope on the floor, and then throwing another rope on top (also tangled, but much shorter). No matter what you do and how you try to arrange the ropes in parallel (close to each other), only the beginning and the end of both ropes will perfectly match. With the middle, you will have inevitable problems dictated by the geometry of space itself and the different entanglement and length of the ropes (the need to both maintain the contextual coherence of the answer and the contextual binding to the query points).

This means that in fact, AI does not forget the middle, but simply cannot, due to mathematical (or more precisely, geometric) incompatibility, build a short answer based on interconnected points that are simultaneously coordinately close to the long line of query points.

For the same reason, LLM answers short questions better and more accurately.

And what about people?

Unlike AI, in the human brain, good memorization of the beginning and end of the text was only considered as a single phenomenon in the 19th and early 20th centuries: the Serial position effect. In the second half of the last century, scientists found out that these are two independent effects with different causes.

The primacy effect – remembering the beginning of the text (first described by Bennett Murdock in 1962) is explained by the fact that the initial elements of any text are stored in long-term memory due to the fact that they are given more attention during the start of reading.

At the same time, a longer text does not strengthen, but on the contrary reduces this effect.

The recency effect – remembering the end of the text (described in 1966 by Glanzer and Kunitz) is initially weaker than the primacy effect and is explained by the peculiarities of the short-term memory of the human brain (small volume and fixation on the last element).

This example shows how randomly coinciding processes are mistakenly interpreted as symbols of the systemic commonality of AI and the human brain.