
Most recent advances in artificial intelligence—notably deep learning and more specifically large language models (LLMs)—are largely due to access to the vast amount of data generated by humans and made available online. But we have already exhausted all that information, and obtaining new training data for models has become one of the major bottlenecks of future advances in AI.
In a new paper (PDF), David Silver and Richard Sutton, two renowned AI scientists, argue that the pace of progress driven solely by supervised learning from human data is slowing and that we need a new approach to create better systems, which they call the “Era of Experience.” Instead of relying on human data, AI systems will gather data from their experience in the world and use it to improve themselves.
There is much to unpack but before we get to it, it’s important to know a bit about the authors and their background. David Silver, of DeepMind fame, was among the key contributors to milestone projects, including AlphaGo, AlphaZero, and AlphaStar, all important achievements in deep reinforcement learning. Also importantly, he was co-author of a paper in 2021 that claimed that with a well-designed reward signal and a complex environment, we can achieve artificial general intelligence (AGI).
Richard Sutton, professor of computer science at the University of Alberta, is a pioneer in reinforcement learning and the co-author of the seminal book Reinforcement Learning: An Introduction. He was also Silver’s doctoral advisor. In 2019, he wrote a popular essay titled “The Bitter Lesson,” in which he argued that real advances in AI will rely on “search and learning” at scale as opposed to hand-engineered solutions that try to formulate human knowledge into AI systems.
Both have been vindicated by the recent explosion in advanced AI systems. Large language models (LLMs) proved that there is a lot of untapped potential in scaling learning by just increasing the size of models and training datasets. And more recently, models such as DeepSeek-R1 showed that a simple reward signal and reinforcement learning is enough to improve the performance of models on very complex reasoning tasks.
Their new paper builds on the same concepts and presents ideas for unlocking the next stage of advances in AI without being held back by human bottlenecks.
AI experiences
To progress significantly further, they argue, we need a dynamic data source that evolves and improves as AI agents become stronger. “This can be achieved by allowing agents to learn continually from their own experience, i.e., data that is generated by the agent interacting with its environment,” they write.
Silver and Sutton argue that “experience will become the dominant medium of improvement and ultimately dwarf the scale of human data used in today’s systems.”
The authors propose four dimensions along which AI agents break free of the limitations of human-centric design as they gather experience in their environments:
Streams
Current LLMs are focused on short interaction episodes or chat sessions. Very little or no information carries over across episodes. This makes it very difficult for AI systems to adapt to changes to their environment over time. Therefore, the AI agent is optimized to maximize outcomes within a single episode as opposed to planning for long-term goals.
“Powerful agents should have their own stream of experience that progresses, like humans, over a long timescale,” the authors write. “This will allow agents to take actions to achieve future goals, and to continuously adapt over time to new patterns of behaviour.”
For instance, an agent for language learning would track a user’s progress over weeks or months, remember the vocabulary the user learned, grammar points they struggled with, and overall goals. It could then proactively adjust its short- and long-term goals, suggest tailored exercises, or adapt its teaching style based on this rich history.
Actions and observations
Agents will act autonomously in the real world as opposed to human-centered actions and observations.
Recently, AI agents have started interacting with computers using the same interface as humans (e.g., mouse clicks and keystrokes), opening up many new possibilities. “Such agents will be able to actively explore the world, adapt to changing environments, and discover strategies that might never occur to a human,” the authors write.
While these “human-friendly” interactions can help to autonomously understand and control the digital world, agents are not limited to them. They can also take “machine-friendly” actions such as API calls. Agents will also interact with the real world via digital interfaces. So they will have a range of sensory options, favoring the more machine-friendly options (e.g., wireless communications and API calls) but falling back to legacy options (e.g., physical manipulation and human interfaces) when machine-friendly options are not available.
Rewards
As agents become able to autonomously observe the world and take action, they will also become capable of designing their own reward signals as opposed to relying on human judgment. This will help them discover new ideas beyond existing human knowledge.
Going back to Silver’s 2021 paper, the authors argue, “There is an argument that even a single such reward signal, optimised with great effectiveness, may be sufficient to induce broadly capable intelligence. This is because the achievement of a simple goal in a complex environment may often require a wide variety of skills to be mastered.”
The agents can design their reward function from the multitude of measurable metrics they can collect from the world (cost, error rates, hunger, productivity, health metrics, climate metrics, profit, sales, exam results, etc.) and adapt it over time as they receive feedback from users and their environment. This can also help the agents better understand user goals and ground them with real-world data.
For example, if a user says they want to stay fit, an AI health assistant can combine that high-level goal with other reward signals, such as the user’s vital signs and health metrics, to create a composite reward function. As the user progresses and provides feedback to the agent, it can gradually adjust its reward function with the user’s changing goals, adding new metrics and data or removing those that are no longer relevant.
Planning and reasoning
Current reasoning models have been designed to imitate the human thought process, often referred to as “chain of thought” (CoT). However, the authors argue, it is “highly unlikely that human language provides the optimal instance of a universal computer. More efficient mechanisms of thought surely exist, using non-human languages that may for example utilise symbolic, distributed, continuous, or differentiable computations. A self-learning system can in principle discover or improve such approaches by learning how to think from experience.”
AI agents must be grounded in real-world data that provide a feedback loop and allow the agent to validate its assumptions against reality and discover new principles beyond the current human thinking mode. “Without this grounding, an agent, no matter how sophisticated, will become an echo chamber of existing human knowledge,” the authors write.
This goes back to the ability of agents to observe the world and take actions as opposed to having second-hand experiences from human-collected data.
The authors suggest that one possible way to directly ground thinking in the external world is to “build a world model that predicts the consequences of the agent’s actions upon the world, including predicting reward… Given a world model, an agent may apply scalable planning methods that improve the predicted performance of the agent.”
What are the challenges of the era of experience?
Previous phases of RL were able to make impressive progress on some tasks (e.g., games such as chess, go, StarCraft 2, Dota 2) but did not bridge the gap between their simulated environments and the real world (sometimes referred to as the sim2real gap).
Subscribe to continue reading
Become a paid subscriber to get access to the rest of this post and other exclusive content.