The Inference-Time Compute Revolution
In early large language models, improvement followed a simple rule:
train bigger models on more data with more compute
Intelligence was scaled almost entirely at training time.
That still matters—but modern reasoning models introduce a second axis:
scaling at inference time
This shift fundamentally changes how we design systems, especially agents.
From Instant Answers to Deliberate Reasoning
Classic LLMs behave like fast predictors:
prompt → answerThey generate the most likely continuation immediately.
This is powerful—but flawed. The model can:
- commit too early
- miss edge cases
- fail to revise its approach
Reasoning models change the flow:
prompt → internal reasoning → answerInstead of answering immediately, they:
- break down the problem
- explore alternatives
- check consistency
- refine their plan
The key difference is simple:
they spend compute before answering
What “Thinking Before Speaking” Really Means
“Thinking before speaking” is a useful phrase—but slightly misleading.
These models are still neural networks—not symbolic reasoners.
What changes is:
they are allowed to use more tokens for intermediate reasoning before producing the final answer
Different systems expose this differently:
- hidden reasoning (OpenAI-style)
- visible thinking blocks (Anthropic)
- explicit reasoning traces (DeepSeek)
The interface varies. The principle does not.
A Useful Mental Model
Think of the model as writing a scratchpad internally:
User: What is 17 × 24?
[internal reasoning]17 × 24 = 17 × 20 + 17 × 4= 340 + 68= 408
Final answer: 408You may not see this reasoning—but it still consumes compute.
Reasoning tokens are real: they cost latency, money, and budget.
The Tradeoff: Depth vs Cost
More reasoning → better answers But also → slower and more expensive
This leads to a core design lever:
| Reasoning Depth | Speed | Cost | When to Use |
|---|---|---|---|
| Low | Fast | Low | Simple tasks |
| Medium | Moderate | Medium | Multi-step reasoning |
| High | Slow | High | Planning, math, code |
Do not pay for thinking when the task is trivial.
Do pay for it when mistakes are expensive.
Why This Changes Agent Design
Previously, agents had to externalize reasoning:
observe → reason → plan → act → reflectMost of the intelligence lived outside the model, in loops.
Now, part of that loop moves inside the model call.
| Phase | Earlier Agents | Reasoning Models |
|---|---|---|
| Reason | External loops | Internalized |
| Plan | Explicit prompting | Often internal |
| Reflect | Separate steps | Sometimes internal |
This leads to a key shift:
fewer outer loops, more powerful inner calls
The Core Design Choice
You now have two ways to build intelligence:
1. Externalize it (classic agents)
- more loops
- more orchestration
- cheaper models
- more control
2. Internalize it (reasoning models)
- fewer calls
- deeper thinking per call
- higher cost per step
- simpler orchestration
Neither is universally better.
Fast models are best for:
- routing
- formatting
- extraction
- simple tool use
Reasoning models are best for:
- planning
- code reasoning
- math and logic
- evaluation and critique
The Practical Rule
Don’t ask:
“Should I use a reasoning model?”
Ask:
“Where is a wrong decision expensive?”
Use reasoning models there.
Examples:
- choosing a multi-step plan
- deciding if evidence is sufficient
- reviewing code before execution
- evaluating outputs
The Bigger Picture
The old scaling story was:
- bigger models
- more data
- better training
The new story adds:
- more inference compute
- deeper reasoning
- better performance on hard tasks
Intelligence is no longer only trained— it is also computed at runtime.
# This code sample is not tested, and is provided just as reference# Only Gemini and Ollama based code on this portal are testedfrom openai import OpenAI
client = OpenAI()
response = client.chat.completions.create( model="o1", messages=[ { "role": "user", "content": "Design a fault-tolerant multi-agent system." } ], reasoning_effort="high", # 🔥 Increase internal reasoning depth)
print(response.choices[0].message.content)
# Includes hidden reasoning tokensprint(response.usage)// This code sample is not tested, and is provided just as reference// Only Gemini and Ollama based code on this portal are testeduse reqwest::Client;use serde_json::json;
#[tokio::main]async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = Client::new();
let body = json!({ "model": "o1", "messages": [{ "role": "user", "content": "Design a fault-tolerant multi-agent system." }], "reasoning_effort": "high" });
let response = client .post("https://api.openai.com/v1/chat/completions") .header("Authorization", format!("Bearer {}", std::env::var("OPENAI_API_KEY")?)) .header("Content-Type", "application/json") .json(&body) .send() .await? .json::<serde_json::Value>() .await?;
let msg = &response["choices"][0]["message"]["content"];
println!("=== Answer ===\n{}", msg);
Ok(())}Closing
Inference-time compute doesn’t replace agents.
It changes them.
Some reasoning moves inside the model. The rest still lives in your system.
The real skill now is deciding where each piece belongs.
→ Next up: LLM Basics and Primitives for Agents