Automatic Reasoning and Tool-use (ART)

Large Language Models are good at thinking. They’re also good at acting — if you give them the right tools and tell them when to use them.
The problem: most tool-using LLM pipelines today rely on hand-crafted prompts and static scripts that hardcode when to call a calculator, a search engine, or a code executor. They’re brittle. They don’t adapt.

ART (Automatic Reasoning and Tool-use) is a different approach:

  • Give the LLM a library of example tasks that show reasoning + tool calls.
  • Freeze the model — no retraining — and let it plan solutions step-by-step.
  • Whenever a tool call appears, pause the LLM, run the tool, and feed the output back into the conversation before continuing.
  • Optionally, let humans fix mistakes or add tools without retraining the model.

It’s program synthesis meets orchestration.


🚦 The ART Loop

  1. Select Examples – Pull relevant reasoning + tool-use demonstrations from the task library.
  2. Run Program – Generate reasoning steps; pause to call tools as needed.
  3. Fix Mistakes (Optional) – Allow human edits or new tools to be added dynamically.

Code Demo: Mini-ART with Python

Below is a minimal simulation of ART in Python. We’ll give the LLM a math problem, let it decide to use a calculator, and resume reasoning after getting the tool’s output.

import openai

openai.api_key = "your-api-key"

# Define tool functions
def calculator(expression: str) -> str:
    try:
        return str(eval(expression))
    except Exception as e:
        return f"Error: {e}"

# Example Task Library
task_library = [
    {
        "task": "Math: Calculate sum",
        "steps": [
            "Q1: [reason] Identify the numbers to add.",
            "Q2: [tool:calculator] 2 + 2",
            "Q3: [reason] State the result."
        ]
    }
]

# Input problem
new_task = "What is 17 * 24 plus 10?"

# Prompt the model
system_prompt = """You are an assistant that solves problems step-by-step.
If a step requires calculation, output: [tool:calculator] <expression>
Resume reasoning after receiving the tool's output."""

response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Task: {new_task}"}
    ],
    max_tokens=150
)

steps = response["choices"][0]["message"]["content"].split("\n")

# Simulate ART execution loop
for step in steps:
    if step.startswith("[tool:calculator]"):
        expression = step.split("]")[1].strip()
        tool_result = calculator(expression)
        print(f"🔧 Tool Output: {tool_result}")
    else:
        print(f"🤖 LLM: {step}")

What This Code Demonstrates

  • Dynamic Tool Use: The model decides when to call a tool.
  • Interleaved Execution: LLM pauses for tool results before continuing.
  • Extensibility: Adding a new tool is just a matter of defining a function and teaching it in the library.

Why This Matters

Benchmarks show ART can outperform:

  • Standard few-shot prompting
  • Automatic Chain-of-Thought (CoT)
  • Even hand-crafted CoT in some cases — especially when paired with human feedback.

For complex, multi-step reasoning tasks (math, code generation, multi-hop search), this method moves us closer to agents that adapt rather than just follow a script.