
This article is part of our coverage of the latest in AI research.
Large language models (LLM), neural networks trained on huge corpora of text (or other types of data) have become a hot topic of discussion in the artificial intelligence community, especially since a Google engineer claimed that one of the company’s LLMs was sentient.
On the one hand, large language models can perform wonderful feats, generating large sequences of text that are mostly coherent and create the impression that they have indeed mastered human language and its underlying skills.
On the other hand, numerous experiments show that LLMs are just parroting their training data and are only showing impressive results because they have been exposed to huge amounts of text and break as soon as they are presented with tasks and problems that require reasoning, common sense, and skills that are implicitly learned.
But a new study by researchers at the University of Tokyo shows that if you provide the LLMs with well-crafted prompts, you can steer them toward answering questions that require reasoning and step-by-step thinking. The researchers present a method called “zero-shot chain-of-thought” prompting, which uses a special trigger phrase in the prompt to force the LLM to go through the steps required to solve problems. And although simple, the method seems to work well frequently.
While other studies contest the notion of LLMs being able to reason, zero-shot CoT proves that, if you know how to query LLMs, they will be better positioned to provide a reasonable answer.
Chain of thought reasoning in LLMs
OpenAI’s GPT-3, a large language model with 175 billion parameters launched in 2020, introduced the idea that LLMs are zero-shot and few-shot learners. This means that a single machine learning model, trained on a very large corpus of text, can be used for many NLP tasks with little or no extra training. In some cases, the model can perform the task out-of-box (zero-shot learning). In others, it needs to be conditioned with a few examples embedded in the prompt (few-shot learning).
However, when it comes to tasks that require methodical, step-by-step thinking, LLMs perform very poorly. These tasks are often called “System 2” tasks, styled after the two-system thinking model presented by Daniel Kahneman.
To address this shortcoming, researchers have proposed “Chain of Thought” prompting, a few-shot prompting technique that helps show the LLM how to reason about the problem. In CoT, you prepend the prompt with an example that has been solved correctly. For example, let’s say you provide the following zero-shot prompt to an LLM:
Q: A juggler can juggle 16 balls. Half of the balls are golf balls, and half of the golf balls are blue. How many blue golf balls are there?
A: The answer (arabic numerals) is
The LLM will most likely provide you with the wrong answer because it can’t do the step-by-step calculations that reach the result. However, if you prepend the prompt with an example that shows the reasoning steps that go into solving a similar problem, then the LLM will be conditioned to provide the right output:
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11.
Q: A juggler can juggle 16 balls. Half of the balls are golf balls, and half of the golf balls are blue. How many blue golf balls are there?
A:
(LLM Output) The juggler can juggle 16 balls. Half of the balls are golf balls. So there are 16 / 2 = 8 golf balls. Half of the golf balls are blue. So there are 8 / 2 = 4 blue golf balls. The answer is 4.
CoT is a subset of the larger field of few-shot learning, in which LLMs are primed for their task by giving them relevant information in the prompt. Chain of thought reasoning has become an area of interest in the AI community and experiments show that the performance of CoT prompting improves as LLMs become larger.
Takeshi Kojima, the lead author of the zero-shot reasoning paper, calls these methods “test-time adaptation,” a range of techniques that help machine learning models correct their output by themselves without the need for fine-tuning. He and his colleagues had previously experimented with test-time adaptation in computer vision.
“We have been interested in the prompting method in LLMs, since it is closely related to the test-time adaptation in the sense that both methods try to improve the performance of pre-trained model during test-time,” Kojima told TechTalks.
In their paper, the researchers note that one of the limits of classic CoT is that the few-shot prompt must be engineered based on the task that the LLM must perform. And experiments show that if the few-shot CoT prompt does not match the task, the performance of the LLM deteriorates considerably.
The researchers wanted to see if it was possible to create a task-agnostic way to trigger CoT reasoning in the LLM without the need to craft full examples in the prompt.
Zero-shot chain of thought reasoning
“In the first place, I was focusing on a sentiment analysis task in NLP, and was thinking that it would improve the accuracy of the task if we appropriately set up an intermediate task (e.g., summarization) before reaching final prediction,” Kojima said.
He discovered that by adding the phrase “Let’s think step by step” to the prompt, he could improve the performance of the model. He later found out that applying the same prompt to many other tasks triggered similar results.
In the paper, the researchers write, “we show that LLMs are decent zero-shot reasoners by adding a simple prompt, Let’s think step by step, to facilitate step-by-step thinking before answering each question.”
Returning to the example above, the zero-thought CoT prompt would look like this:
Q: A juggler can juggle 16 balls. Half of the balls are golf balls, and half of the golf balls are blue. How many blue golf balls are there?
A: Let’s think step by step.
(LLM Output) There are 16 balls in total. Half of the balls are golf balls. That means that there are 8 golf balls. Half of the golf balls are blue. That means that there are 4 blue golf balls.
Note that we did not add the manually crafted CoT prompt. And intriguingly, adding this very simple phrase causes the LLM to go through the same step-by-step problem-solving process that it does when conditioned with the classic CoT prompt.
The researchers describe zero-shot CoT as a versatile and task-agnostic method that can be used in different settings without modification.
“Our method (Zero-shot-CoT) differs from other CoT studies in that it does not need training data nor few-shot exemplars,” Kojima said. “This is a good property because we do not need to prepare carefully hand-crafted data for each task.”
Kojima notes that the performance of zero-shot CoT is not superior to existing CoT approaches, “but it greatly pushes up the baseline performance of zero-shot learning with an easy-to-deploy technique.”
One of the caveats of using zero-shot CoT properly is to prompt the model twice to get the desired output. The first prompt poses the main question along with zero-shot CoT phrase (“Let’s think step by step”), which causes the model to produce its entire reasoning process. The second prompt instructs the LLM to provide the main answer.
For example, the first prompt might look like this:
Q: On average Joe throws 25 punches per minute. A fight lasts 5 rounds of 3 minutes. How many punches did he throw?
A: Let’s think step by step.
This prompt triggers the CoT mechanism in the LLM, which produces the following result:
In one minute, Joe throws 25 punches.
In three minutes, Joe throws 3 * 25 = 75 punches.
In five rounds, Joe throws 5 * 75 = 375 punches.
Next, the entire prompt and output are fed back to the LLM, along with a phrase that extracts the final answer:
Q: On average Joe throws 25 punches per minute. A fight lasts 5 rounds of 3 minutes. How many punches did he throw?
A: Let’s think step by step.
In one minute, Joe throws 25 punches.
In three minutes, Joe throws 3 * 25 = 75 punches.
In five rounds, Joe throws 5 * 75 = 375 punches.
Therefore, the answer (arabic numerals) is
To which the LLM answers: 375

Zero-shot reasoning with large language models
The researchers tested zero-shot CoT on several LLMs and against other techniques. The performance gain in comparison to plain zero-shot prompting was very significant. The performance of the technique was also comparable to the classic few-shot CoT prompting, which requires hand-crafted examples for each prompt.
The researchers noted that even when zero-shot CoT made the wrong predictions, its chain of thought seemed reasonable and flexible.
“Our method may reduce performance for some tasks,” Kojima said. “For example, using Zero-shot-CoT prompting for a task that can be answered intuitively will reduce performance due to overthinking.
So, we need to understand the logic behind why certain prompts work well and certain prompts do not. This might lead us to reproducibly/automatically discover high-level prompts that can be used for a variety of tasks other than reasoning (e.g., optimization, imagination, etc.).”
The researchers also tested slightly different zero-shot CoT phrases, such as “Let’s be realistic and think step by step” and “Let’s think like a detective step by step.” Interestingly, the performance was inferior to the original “Let’s think step by step” prompt. Kojima himself was not entirely sure why this difference exists.
“By comparing those three prompts, we can assume that LLMs like to think freely on their own. If we add too much biased direction into the prompt, the model tends to fail to reason. So, the simpler the prompt is, the better reasoning process we can elicit from the models,” Kojima said. “But this is just a hypothesis.”
Are LLMs really reasoning?
A few weeks ago, I reviewed a different study that showed transformers, the deep learning architecture that underlies LLMs, don’t learn the reasoning functions that underlie their problem space and find statistical relations between features that can’t be removed from the training examples.
When asked how the findings of that study relate to the results that CoT and zero-shot CoT show, Kojima said the mentioned paper, “On the Paradox of Learning to Reason from Data,” analyzes the ability of fine-tuned models, while it is not obvious if it generally applies to recent LLMs pre-trained on a large text corpus.
“In our paper, we have not analyzed if the prediction performance is highly correlated with the statistical features,” Kojima said. “But it’s one of the interesting study areas, which may affect most of the CoT studies”
What’s for sure is that large language models have become complex entities that are worthy of their own field of study. LLMs might not have not yet learned to understand and speak our language, but we are starting to learn to speak theirs.
“We hope our work not only serves as the minimal strongest zero-shot baseline for the challenging reasoning benchmarks, but also highlights the importance of carefully exploring and analyzing the enormous zero-shot knowledge hidden inside LLMs before crafting fine-tuning datasets or few-shot exemplars,” Kojima said.