
This article is part of our reviews of AI research papers, a series of posts that explore the latest findings in artificial intelligence.
Why do we sleep? One obvious reason is to restore the strength of our bodies and limbs. But another very important role of sleep is to consolidate memories and organize all the information that your brain has ingested while being awake. People who lack proper sleep see their cognitive abilities degrade and their memories fail.
The wonders and mysteries of sleep remain an active area of research. But aside from medicine, psychology, and neuroscience, the study of sleep serves other fields of science. Artificial intelligence researchers are also looking into the work done in the field to develop AI models that are more efficient at handling data over longer timespans.
Recent work by AI researchers at DeepMind show leverages the study of the brain and the mechanisms of sleep, to tackle with one of the fundamental challenges of natural language processing (NLP): dealing with long-term memory.
AI’s struggling language memory
The human brain has a very fascinating way to organize memory. We can manage different lines of thought over a long period. Consider this hypothetical example: You wake up in the morning and spend 45 minutes to read a book about cognitive science. An hour later, you skim over the news and read a few news articles. In the afternoon, you resume your study of a new AI research paper, which you had started a few days ago, and take notes for a future article. During your routine daily exercise, you listen to a science podcast or an audiobook. And at night, before going to sleep, you open a fantasy novel and pick up where you had left off the previous night.
You don’t need to be a genius to do this (this is roughly what I do every day, and I don’t claim to be smarter than the average person). In fact, most of us handle an even more diverse set of information every day. And what’s interesting is that not only our brain is able to preserve and manage these buckets of information, it can do so over a long period, day, weeks, months, and even years.
In recent years, AI algorithms have gradually become better at maintaining consistency over lengthier streams of data, but they still have a long way to go before they can match the skills of the human brain.
The classic type of machine learning construct used for handling language is the recurrent neural network (RNN), a type of artificial neural network designed to deal with temporal consistency in data. An RNN trained on a corpus of data—say a large dataset of Wikipedia articles—can perform tasks such as predicting the next word in a sequence or finding the answer to a question (given the question is directly answered in the training data).
The problem with the earlier version of RNNs was the amount of memory they needed to handle information. The longer the sequence of data the AI model could process, the more memory they needed. This limit was mainly because, unlike the human brain, neural networks don’t know which parts of data they should keep and which parts they can discard.
Extracting the important bits of information
Consider this: When you read a novel, say The Lord of the Rings, your brain doesn’t memorize all the words and sentences. It is optimized to extract meaningful information from the story, including the characters (e.g., Frodo, Gandalf, Sauron), their relations (e.g., Boromir is Frodo’s friend—almost), locations (e.g., Rivendell, Mordor, Rohan), objects (e.g., The One Ring, Anduril), key events (e.g., Frodo throws the One Ring in the heart of Mount Doom, Gandalf falls into the pit of Khazad Dum, the battle of Helm’s Deep), and maybe a few very important bits of dialog in the story (e.g., not all that glitters is gold, not all those who wander are lost).
This small amount information is very crucial to being able to follow the story’s plotline across all four books (The Hobbit and all three volumes of The Lord of the Rings) and 576,459 words.
AI scientists and researchers have been trying to find ways to embed neural networks with the same kind of efficient information handling. One great achievement in the field has been the development of “attention” mechanisms, which enable neural networks to find and focus on the more important parts of data. Attention has enabled neural networks to handle larger amounts of information in a more memory-efficient manner.
Transformers, a type of neural network that has become increasingly popular in recent years, has put the intention mechanism to efficient use, allowing AI researchers to create larger and larger language models. Examples include OpenAI’s GPT-2 text generator, trained on 40 gigabytes of text, Google’s Meena chatbot, trained on a 341-gigabyte corpus, and AI2’s Aristo, a deep learning algorithm trained on 300 gigabytes of data to answer science questions.
All these language models manifest remarkable consistency over longer sequences of text than previous AI algorithms. GPT-2 can frequently (but not always) spit out a fairly coherent text that spans across several paragraphs. Meena has not been released yet, but the sample data that Google has made available show interesting results in conversations that go beyond simple queries. Aristo outperforms other AI models at answering science questions (though it can only answer multiple-choice questions).
However, what’s obvious is that language-processing AI still has room for a lot of improvement. For the moment, there’s still a drive to improve the field by creating larger neural networks and feeding them even bigger and bigger datasets. Clearly, our brains don’t need—and don’t even have the capacity for—hundreds of gigabytes worth of data to learn the basics of language.
Drawing inspiration from sleep
When memories are created in our minds, they start as a jumble of sensory and cognitive activities encoded across different parts of the brain. This is the short-term memory. According to neuroscience research, the hippocampus collects activation information from neurons in different parts of the brain and records it in a way that becomes accessible memory. It also stores the cues that will reactivate those memories (a name, smell, sound, sight, etc.). The more a memory is activated, the stronger it becomes.
Now here’s where sleep comes into play. According to Marc Dingman, the author of Your Brain Explained (I strongly recommend reading it), “Studies have found that the same neurons that are turned on during an initial experience are reactivated during deep sleep. This has led neuroscientists to hypothesize that, during sleep, our brains are working to make sure important memories from the previous day get transferred into long-term storage.”
DeepMind’s AI researchers have taken inspiration from sleep to create the Compressive Transformer, a language model that is better suited for long-range memory. “Sleep is known to be crucial for memory, and it’s thought that sleep serves to compress and consolidate memories, thereby improving reasoning abilities for memory tasks. In the Compressive Transformer, granular memories akin to episodic memories are collected online as the model passes over a sequence of inputs; over time, they are eventually compacted,” the researchers write in a blog post that accompanies the full paper of the Compressive Transformer.
Like other variations of the Transformer, the Compressive Transformer uses the attention mechanism to choose relevant bits of data in a sequence. But instead of discarding old memories, the AI model removes the irrelevant parts and combines the rest by keeping the salient bits and storing them in a compressed memory location.

According to DeepMind, the Compressive Transformer shows state-of-the-art performance on popular natural language AI benchmarks. “We also show it can be used effectively to model speech, handles rare words especially well, and can be used within a reinforcement learning agent to solve a memory task,” the AI researchers write.
What’s relevant, however, is that the AI improves performance on the modeling of long text. “The model’s conditional samples can be used to write book-like extracts,” DeepMind’s researchers write.
The blog post and paper contain samples of the Compressive Transformer’s outputs, which is pretty impressive in comparison to other work that is being done in the field.
Language has not been solved yet
Compression is not the same thing as filing away important components. Let’s go back to the Lord of the Rings example to see what this means. For instance, after reading the chapter where the fellowship holds a council at Elrond’s house, you don’t necessarily remember every word of the exchange between the attendants. But you remember one important thing: While everyone is quarreling about how to decide the fate of the One Ring, Frodo steps forth and accepts responsibility to throw it in Mount Doom. Therefore, to compress information, the mind seems to transform it when storing away memories. And that transformation continues as memories become older.
Clearly, there is some sort of pattern recognition that enables the Compressive Transformer to find the relevant parts that should be stored in the compressed memory segment. But it remains to be seen whether these bits of data are equivalent to the elements mentioned in the example earlier above.
The challenges of using deep learning algorithm to process human language have been well documented. While statistical approaches can find interesting correlations and patterns in large corpora of data, they can’t perform some of the subtler tasks that require knowledge beyond what is contained in the text. Things like abstraction, commonsense, background knowledge and other aspects of intelligence that allow us to fill the blanks and extract the implicit meanings behind words remain unsolved with current approaches in AI.
As computer scientist Melanie Mitchell explains in her book Artificial Intelligence: A Guide for Thinking Humans, “It seems to me to be extremely unlikely that machines could ever reach the level of humans on translation, reading comprehension, and the like by learning exclusively from online data, with essentially no real understanding of the language they process. Language relies on commonsense knowledge and understanding of the world.”
Adding those elements will enable AI models to deal with the uncertainties of language. Cognitive scientist Gary Marcus told me last year, “Except for a few small sentences, almost every sentence you hear is original. You don’t have any data directly on it. And that means you have a problem that is about inference and understanding. The techniques that are good for categorizing things, putting them into bins that you already know, simply aren’t appropriate for that. Understanding language is about connecting what you already know about the world with what other people are trying to do with the words they say.”
In Rebooting AI, Marcus and his co-author, New York University professor Ernest Davis, write, “Statistics are no substitute for real-world understanding. The problem is not just that there is a random error here and there, it is that there is a fundamental mismatch between the kind of statistical analysis that suffice for translation and the cognitive model construction that would be required if systems were to actually comprehend what they are trying to read.”
But compression might help us find new directions in AI and language modeling research. “Models which are able to capture relevant correlations across days, months, or years’ worth of experience are on the horizon. We believe the route to more powerful reasoning over time will emerge from better selective attention of the past, and more effective mechanisms to compress it,” the AI researchers at DeepMind write.