New technique teaches LLMs to optimize their “thought” process

robot thinking

This article is part of our coverage of the latest in AI research.

Large language models (LLMs) with advanced reasoning capabilities such as OpenAI o1 have shed new light on a new inference paradigm: Giving models time to think. The assumption is that LLMs can solve more complex problems if they are given more compute cycles to generate their thought process about the problem they are solving.

However, LLMs often lack the ability to explicitly “think” before responding. To address this challenge, researchers from Meta FAIR, UC Berkeley, and NYU have introduced a new training method that equips LLMs with this thinking ability, enhancing their performance on general instruction following without requiring additional human-labeled data.

The role of thought in instruction following

Humans often think before responding to complex questions or instructions. This internal deliberation allows us to formulate better responses, especially when dealing with challenging or nuanced tasks. While techniques like Chain-of-Thought (CoT) prompting can elicit intermediate reasoning steps from LLMs, their application has been largely limited to math and reasoning tasks. But internal thoughts can improve other kinds of tasks, such as designing the structure and characters of a creative writing piece or better understanding user instructions in different tasks.

“We argue that ‘thinking’ should have broad utility,” the researchers write. “Of course, it is likely that less thinking is required for simpler tasks, and more thinking for more complex ones. In general, we hypothesize that such Thinking LLMs will have an advantage on all sufficiently complex tasks.”

However, training LLMs to think presents a significant challenge due to the lack of supervised training data. Human thought processes are often implicit and not explicitly labeled or included in training corpora. 

Thought Preference Optimization (TPO)

The researchers introduce Thought Preference Optimization (TPO), a training method that guides LLMs to learn and optimize their internal thought processes. 

The idea behind TPO is to train an LLM to create a response consisting of two parts: a “thought” part and a “response” part. The thought part is considered internal and is generally hidden from the user, as with models such as OpenAI o1. This allows the LLM to explore various thought processes that might not be of interest to the user. For example, the model can draft multiple responses, evaluate and correct their mistakes, and internally reason over the question’s nature before responding.

Models can’t perform this kind of reasoning without being optimized. To optimize the thought and response generation processes, TPO uses Reinforcement Learning from AI Feedback (RLAIF), a technique in which the LLM’s response is evaluated through a reward or judge model, and the feedback is used to fine-tune the model. 

thought preference optimization
Thought Preference Optimization (TPO) (source: arXiv)

In TPO, the judge model is trained to evaluate the quality of the LLM’s final response without going over the thinking part. This approach has several advantages. First, it doesn’t require manually curated thought data or a specialized judge model capable of evaluating thoughts. Second, it directly optimizes the final objective – generating better responses – rather than relying on an auxiliary objective that might not align well with the overall goal.

“Instead of directly guiding the internal thought process, we allow the model to independently learn to think,” the researchers write.

Once TPO curates enough evaluated responses, it fine-tunes the model and uses the updated model to run another iteration of the optimization process. This cycle is repeated until the model reaches the desired performance.

Evaluating Thinking LLMs

The researchers evaluated their method on the AlpacaEval and Arena-Hard benchmarks, which test general instruction following. They used Llama-3-8B-Instruct as the base model for their experiments.

TPO vs baseline LLMs
TPO vs baseline LLMs (source: arXiv)

Initially, simply prompting the LLM to generate thoughts decreased performance compared to the baseline model without explicit thinking. However, after several iterations of TPO training, the Thinking LLM caught up with and even surpassed the baseline, demonstrating that the model learned to leverage its internal thoughts effectively.

The Thinking LLM achieved a strong win rate on both benchmarks, outperforming its direct LLM counterpart. Further analysis showed that thinking not only benefited tasks typically associated with reasoning, such as problem-solving, but also improved performance in other categories, including general knowledge, marketing, and health.

TPO poem (source: arXiv)

One interesting example highlighted in the paper is poem generation. While not traditionally considered a reasoning task, poem writing can benefit from better planning and understanding of the instructions. This demonstrates the potential of Thinking LLMs to enhance performance in a broader range of tasks.

“This opens up a new opportunity to develop Thinking LLMs aimed at general instruction following rather than specializing in more narrow technical fields,” the researchers write.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.