When Does AI Make Fewer Mistakes in Planning?

Making AI models write every step and fix verifier flagged mistakes raises plan validity on rule bound tasks, MIT study showed.

MIT SMR Editors 2 days ago Reading Time: 5 minutes

Topics

Large language models (LLMs) plan better and pass formal checks when they write out each step and fix mistakes flagged by a verifier, new research from MIT CSAIL and Microsoft AI showed.

MIT CSAIL is the Computer Science and Artificial Intelligence Laboratory at the Massachusetts Institute of Technology, while Microsoft AI is Microsoft’s consumer AI group created in 2024 that works on products like Copilot.

The paper, “Teaching LLMs to Plan: Logical Chain-of-Thought Instruction Tuning for Symbolic Planning” by Pulkit Verma, Ngoc La, Anthony Favier, Swaroop Mishra, and Julie A. Shah, is an early-stage arXiv preprint from mid-September 2025, so peer review is still ahead.

The study’s idea is easy to picture. Instead of jumping to a final answer, the model lays out a plan one move at a time, stating what it wants to do, why that move is allowed under the rules, and what the world looks like after the move.

A software checker then reviews those steps and either gives approval or points to the exact spot that broke the rules. Training repeats this loop so the model stops making the same errors.

In the paper, the checker is a standard validator from the planning community and the plans are written in the Planning Domain Definition Language, a precise way to describe actions, their preconditions, and their effects.

The tasks are classic, rule-tight puzzles that stand in for real workflows. You start in one state, you can take only certain actions when conditions are met, and you must reach a goal without breaking any rules. Big chat models are good at sounding confident, but when every step must be legal they often stumble.

By forcing the model to show its work and by having a separate program check each step, the authors reported a sharp jump in how often the plans are valid.

The numbers are the attention grabber. Using Llama-3 as the base, the method produces valid plans 94% of the time on a well-known puzzle called Blocksworld, 79% on a routing task called Logistics, and 64% on a trickier variant called Mystery Blocksworld.

Similar runs with GPT-4 moved in the same direction, though the authors said those were limited by access.

Across the board, the biggest gains come when the checker does more than say pass or fail and instead tells the model exactly which condition it violated, because that gives the model a concrete error to fix next time.

Training happens in two passes. First, the model is shown many worked examples, including bad plans on purpose, so it learns to spot where a step goes wrong.

Second, it must write out every state change between actions, and those traces are sent to the checker for feedback. That is the whole trick: write down each step in plain, machine-readable form, check it, correct it, and try again.

There are important limits and the paper is clear about them. The goal is a correct plan, not the shortest plan, so the system aims for “good-enough” solutions that reach the goal rather than the perfect route.

The experiments also used a simpler slice of the formal language, without advanced features like actions that take time or effects that depend on hidden conditions. Most importantly, the accuracy comes from using an external checker on purpose, because asking the model to police its own logic is still unreliable on strict rule-following tasks.

All of this matters for how you read the headline figure. Ninety-four percent on Blocksworld is strong, but Blocksworld is the friendliest of the three puzzles, and the scores drop as problems get trickier.

The authors tried to avoid overfitting by keeping separate data for each training phase and for the final test, and by mixing in flawed examples to sharpen the model’s eye for errors, but these are still clean, textbook domains rather than messy, real-world systems.

Even with those caveats, the direction is useful for people who want AI to carry out multi-step tasks that must be correct. If you can describe your task precisely, you can make the model spell out each move and have a checker confirm it.

That will be slower than free-form generation, and it may not find the shortest route, but the tradeoff is reliability.

Think warehouse routing where a picker’s path must respect aisle rules and weight limits, robotic picking where a gripper must not collide or lift a blocked item, software runbooks that roll out updates server by server with health checks between steps, or desktop automations that log into sites, download reports, and fill forms without skipping required fields.

In all of these, a single illegal move can ruin the whole run, so getting more valid plans is a step toward tools you can trust.

Topics

About the Author

Tags:

Topics

Share