Thinking in graphs improves LLMs’ planning abilities, but challenges remain

robot knowledge graph
Image generated with Microsoft Copilot

This article is part of our coverage of the latest in AI research.

Large language models (LLMs) have shown promise in generating basic planning steps. However, they still struggle when it comes to accomplishing complex tasks that require coordinating and planning multiple actions.

A new study by researchers at the University of Oxford and other institutes examines the ability of LLMs to perform asynchronous planning. Their findings reveal that advanced prompting techniques improve the performance of LLMs in complex tasks. In particular, when instructed to plan their goals in the form of a graph, LLMs become much better at accomplishing them. However, even the most advanced models still struggle with tasks that humans find trivial, especially as the complexity of the task increases.

Asynchronous planning

Asynchronous planning requires more than just generating a sequence of actions. It demands understanding temporal relationships, optimizing for parallel execution, and reasoning under constraints.

For example, baking a cake involves many steps such as preparing the dough, pre-heating the oven, baking the cake, preparing the frosting, cooling the cake, and adding the frosting. Some of these steps can be done in parallel (e.g., preparing the dough and preheating the oven) while others need to be done sequentially (e.g., baking the cake and adding the frosting).

In their study, the researchers use a broad definition of evaluating LLMs for asynchronous planning: “Assuming infinite resources (e.g., as many agents and tools as needed to achieve optimal parallelism are available) for a naturalistic task with a set of compulsory steps, the time needed for each step, and step ordering constraints,” can the model compute the optimal time needed for the task. Asynchronous planning is vital in many downstream tasks such as developing autonomous robotic agents.

LLMs struggle with asynchronous planning because they require a combination of skills. Some of these skills include calculating and adding time durations for different tasks, comparing time durations, and reasoning about constraints.

These skills are often interdependent and require a deeper understanding of the relationships between different actions and their temporal implications, something that LLMs struggle with. 

“This compositionality of skills makes asynchronous planning a challenging task, and it is yet unclear whether LLMs are capable of solving it,” the researchers write.

Asynchronous planning
Asynchronous planning (source: arXiv)

AsyncHow and the limits of LLMs

To evaluate the asynchronous planning capabilities of LLMs, the researchers created Asynchronous WikiHow (AsyncHow), a benchmark with more than 1,600 real-life planning problems. They used AsyncHow to test five popular LLMs: GPT-3.5, GPT-4, Cohere Command, LLaMA-2-70B-chat, and Mistral-7B-Instruct.

The LLMs were tested in four different prompting regimes: zero-shot, few-shot, Chain-of-Thought (CoT), and few-shot CoT.

The results show that all models perform poorly when they are not provided with detailed instructions and examples of how to solve the task. Among the tested models, GPT-4 achieves the highest accuracy with few-shot learning examples and step-by-step explanations. But even that leaves much to be desired.

“Even with few-shot illustrations, model performance is unsatisfactory, with the LLMs failing on instances that are trivial for humans,” the researchers write.

Plan Like a Graph (PLaG)

To improve the performance of LLMs on asynchronous planning, the researchers introduced “Plan Like a Graph” (PLaG), a prompting technique that uses graph representations to help LLMs understand the structure of planning problems.

Graphs are a natural way to represent planning problems. They provide a visual representation of the tasks, their dependencies, and their temporal relationships. Several studies show graphs can improve the performance of LLMs on reasoning and planning tasks.

PLaG provides the LLM with few-shot examples of how a task and its solution can be represented as a graph, and how they can use the graph to find the optimal solution.

The researchers tested two variations of PLaG:

Explicit graph: Provides the LLM with a graph representation of the problem and asks it to reason based on the graph.

Build a Graph (BaG): Instructs the LLM to generate its own graph representation of the problem before solving it.

The results show that both versions of PLaG improve the performance of LLMs on AsyncHow. “By converting naturalistic questions to equivalent graph problems, we find that our method boosts the performance of all tested models,” the researchers write.

PLaG can also be applied to off-the-shelf models such as GPT-4 and improve their performance, the researchers find. 

“This is particularly interesting considering that while natural language prompts in the vanilla setting essentially represent the same information as the graphs, explicitly providing prototypical graph-structured data enhances LLM performance and highlights its inherent limitation of reasoning at a conceptual level,” the researchers write.

PLaG out of distribution
PLaG fails to generalize to out-of-distribution examples

While PLaG significantly improves the performance of LLMs on asynchronous planning tasks, the researchers found that even the best-performing models still struggle with complex scenarios. 

Furthermore, the evaluation shows that all the models experience a drastic performance drop when faced with out-of-distribution examples, such as when the number of steps in the task is larger than what they have seen during training. This highlights the limits of the reasoning capabilities of current LLMs.

Despite the limitations, PLaG could be a promising direction for improving the planning capabilities of LLMs. The researchers suggest exploring the integration of PLaG with other techniques, such as reinforcement learning, and adding more elements such as resource constraints and multimodality.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.