Large Language Models are good at thinking. They’re also good at acting — if you give them the right tools and tell them when to use them.
The problem: most tool-using LLM pipelines today rely on hand-crafted prompts and static scripts that hardcode when to call a calculator, a search engine, or a code executor. They’re brittle. They don’t adapt.
ART (Automatic Reasoning and Tool-use) is a different approach:
- Give the LLM a library of example tasks that show reasoning + tool calls.
- Freeze the model — no retraining — and let it plan solutions step-by-step.
- Whenever a tool call appears, pause the LLM, run the tool, and feed the output back into the conversation before continuing.
- Optionally, let humans fix mistakes or add tools without retraining the model.
It’s program synthesis meets orchestration.
🚦 The ART Loop
- Select Examples – Pull relevant reasoning + tool-use demonstrations from the task library.
- Run Program – Generate reasoning steps; pause to call tools as needed.
- Fix Mistakes (Optional) – Allow human edits or new tools to be added dynamically.
Code Demo: Mini-ART with Python
Below is a minimal simulation of ART in Python. We’ll give the LLM a math problem, let it decide to use a calculator, and resume reasoning after getting the tool’s output.
import openai
openai.api_key = "your-api-key"
# Define tool functions
def calculator(expression: str) -> str:
try:
return str(eval(expression))
except Exception as e:
return f"Error: {e}"
# Example Task Library
task_library = [
{
"task": "Math: Calculate sum",
"steps": [
"Q1: [reason] Identify the numbers to add.",
"Q2: [tool:calculator] 2 + 2",
"Q3: [reason] State the result."
]
}
]
# Input problem
new_task = "What is 17 * 24 plus 10?"
# Prompt the model
system_prompt = """You are an assistant that solves problems step-by-step.
If a step requires calculation, output: [tool:calculator] <expression>
Resume reasoning after receiving the tool's output."""
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Task: {new_task}"}
],
max_tokens=150
)
steps = response["choices"][0]["message"]["content"].split("\n")
# Simulate ART execution loop
for step in steps:
if step.startswith("[tool:calculator]"):
expression = step.split("]")[1].strip()
tool_result = calculator(expression)
print(f"🔧 Tool Output: {tool_result}")
else:
print(f"🤖 LLM: {step}")
What This Code Demonstrates
- Dynamic Tool Use: The model decides when to call a tool.
- Interleaved Execution: LLM pauses for tool results before continuing.
- Extensibility: Adding a new tool is just a matter of defining a function and teaching it in the library.
Why This Matters
Benchmarks show ART can outperform:
- Standard few-shot prompting
- Automatic Chain-of-Thought (CoT)
- Even hand-crafted CoT in some cases — especially when paired with human feedback.
For complex, multi-step reasoning tasks (math, code generation, multi-hop search), this method moves us closer to agents that adapt rather than just follow a script.