On Thursday, in a paper published in Science, Alphabet-owned AI research lab DeepMind introduced an artificial intelligence program that can play the first-person shooter game Quake III Arena in Capture the Flag mode with human-level performance.
CTF is a multiplayer mode found in most FPS games, in which players team up to guard their flag against enemies while trying to capture the enemy’s team flag and bring it back to their base. An AI that would want to compete with humans at Quake III CTF would have to be able to do fast-paced teamwork and planning, two unique traits of human intelligence.
We’re at a time where AI playing complex video games is becoming all too common. Earlier this year DeepMind introduced another AI that had defeated human champions in the real-time strategy game StarCraft II. A while later, OpenAI’s team of five neural networks outmatched professional Dota 2 players.
But every video game presents new challenges for AI to overcome. And videos posted online show that “For the Win,” DeepMind’s AI agent, is displaying interesting gameplay and technological innovation.
Why it’s hard for AI to play first-person shooter games
In a way, boardgames such as chess and Go are much more complicated than first-person shooters like Quake III. They require a lot of thinking, careful planning and strategizing, and foresight. In contrast, in Quake III, quick reflexes, grabbing the right weapons and powerups and precision aiming play a greater role. That’s one of the reasons you see many more people playing FPS than board games.
When playing in multiplayer and CTF, teamwork and planning become more important, but the complexity is still not comparable to a game like Go.
However, all that I just mentioned is in human intelligence terms. When viewed from the perspective of available AI technologies, the challenges become very different.
Chess and Go are games of complete information. The AI can see the entire board at once and doesn’t need to guess about the location of opponents. Also, boardgames have limited sets of actions at each given state. This is a situation that is easier for AI to handle, especially when it can scales with the availability of extra compute resources.
In contrast, Quake is a game of incomplete information. The AI can only see as much as the field of view of the player. Objects and other players become occluded behind walls, and the AI agent must learn to navigate an unknown map.
Also, at any given state in the game, the AI can perform so many different actions, from moving or changing directions, to firing weapons, using items,… to combining actions or just doing nothing and waiting.
Also, as opposed to chess and Go, where players take turns to make moves, Quake is played in real time. This makes the environment even more complex because as the game state changes even as the AI is “thinking” about its next move.
Quake poses a perception problem to AI
Most of the challenges we just have previously been addressed in AlphaStar and OpenAI Five, the AI bots that respectively mastered StarCraft II and Dota 2. Many would agree that those two games present a much greater AI challenge than Quake III. Real-time strategy games require both long- and short-term strategy-making. Matches in Dota and StarCraft can last more than an hour. Each round of CTF can be finished in under a minute.
But Quake has its own unique computer vision challenges. Computer vision is the technology that enables software to make sense of the content of images and video. Game-playing AIs use computer vision to process video frames of the game and detect different things such as friendly units, enemy forces, buildings, bridges, rivers, etc.
Currently, the leading technology for computer vision is neural networks and deep learning. Neural networks learn through examples. So, when a neural network sees enough game frames annotated with the type of objects it contains, it will be able to make sense of new frames it hasn’t seen and find different objects in the game environment.
In real-time strategy games such as StarCraft and Dota, the environment is presented from a fairly consistent angle, which means it is a much easier perception problem to solve. But in an FPS game, the scene changes whenever the AI player turns around. An object or an area can have a totally different pixel structure when viewed from various angles, and the AI has to learn to adapt to all those different views. The same goes for characters. Each character is a rich 3D model composed of hundreds or thousands of triangles, and can look very different from different angles.
Here’s what a real Quake III CTF match looks like.
The challenge becomes evident when you look at the videos of the CTF matches DeepMind’s AI has played. In a blog post published on DeepMind’s website, the company’s engineers explain that the game was “aesthetically modified, though all game mechanics remain the same.” Here’s what it looks like.
As it happens, those “aesthetic modifications” make a huge difference. The players have been changed from rich 3D models to colored spheres. Spheres look the same from all angles, which makes it much easier for the AI to detect them. The AI version has also removed or toned down other visual effects such as explosions, lighting and shadows.
Now, some might call this cheating, because the game has been altered in favor of the AI. But DeepMind wanted to test whether the AI could learn to work in teams like human players. In this regard, FTW didn’t disappoint its creators.
Reinforcement learning
Like most other gameplaying AI models, For the Win learned to play Quake III through deep reinforcement learning. DRL is a subset of machine learning, in which the AI model is given the basic actions of an environment and is left to discover the rules and secrets through trial and error. The AI starts by making random moves and weighs the resulting reward or penalty. For instance, in Quake, the AI must capture the enemy flag and bring it back to its own base. Meanwhile, it must also avoid getting killed by enemies and prevent enemy players from capturing its team flag.
Reinforcement learning mimics some of the functionalities of the human mind. At the same time it highlights some of the huge differences between AI and human intelligence.
According to the figures released by DeepMind, it took the AI approximately 100,000 matches to achieve the performance of an average human Quake player. At around 170,000 games, the AI reached “strong human” level, and it eventually played some 450,000 matches.

AI techniques such as reinforcement learning don’t have the abstract thinking and general problem solving capabilities of humans. They don’t develop a conceptual model of the game and solely rely on doing enough trial and error to explore the different ways the game can be played. Humans learn to play and master the game with much less training.
Moreover, humans can carry over experience from other FPS games, such as Overwatch, Fortnite or Far Cry. For artificial intelligence, every single one of those games is a totally new experience that it must learn from scratch. In fact, even making small changes such as changing the coloring of the walls or the shape of the players will probably have a dramatic effect on the AI’s performance.
DeepMind also didn’t release any data about the computational resources allocated to train the AI. But if other game-playing AI models are any indication, hundreds of CPUs/GPUs and hundreds of thousands of dollars were spent on training the data.
The human-level performance comparison
Given the fundamental differences between neural networks and the human mind, it’s hard to say whether DeepMind’s AI is manifesting human-level performance. However, on the surface, the AI has developed many interesting behaviors that are reminiscent of the way humans play the game.
According to DeepMind’s data, over the course of the training, the AI manifested behavior that suggested it had developed its own concepts to represent important events in the game, such as when the enemy or a teammate captures a flag.
The AI agents had also automatically developed what seemed like basic tactics such following teammates, setting up home base perimeter defense, and camping in the enemy base.
How did the AI perform against humans?
After developing FTW, DeepMind set up matches where humans teamed up with the AI and played capture the flag against each other. According to DeepMind, the AI agents exceeded the win-rate of human players, and the participants agreed that the AI was more collaborative than human players.
But not everyone thinks that sort of behavior can be considered teamwork or collaboration. According to Mark Riedl, a professor at Georgia Tech College of Computing, who spoke to The New York Times, the AI agents are responding to what is happening in the game, rather than trading messages with one another, as human players do.
Interestingly, one of the match-ups involved a pair of AI agents playing against a team of human and AI. Team AI bested the human/AI pair eight games to one. This either means AI team was very good, or that the human/AI team was bad because they couldn’t cooperate.
But at the end of the day, not every AI needs to mimic the behavior of humans or cooperate with human counterparts. As long as the AI agents learn to cooperate with each other, they can help solve problems in complex environments. For instance, such AI can come in handy in automated warehouse management, where robots must collaborate in a relatively open environment.
Of course, the real world always presents new problems that can’t be found in games. And every change to the environment presents totally new challenges to AI agents trained in the virtual worlds of games. Until those challenges are overcome, we can sit back and enjoy watching the bots kick our butts in our favorite video games.