The Inference-Time Compute Revolution

In early large language models, improvement followed a simple rule:

train bigger models on more data with more compute

Intelligence was scaled almost entirely at training time.

That still matters—but modern reasoning models introduce a second axis:

scaling at inference time

This shift fundamentally changes how we design systems, especially agents.

From Instant Answers to Deliberate Reasoning

Classic LLMs behave like fast predictors:

prompt → answer

They generate the most likely continuation immediately.

This is powerful—but flawed. The model can:

commit too early
miss edge cases
fail to revise its approach

Reasoning models change the flow:

prompt → internal reasoning → answer

Instead of answering immediately, they:

break down the problem
explore alternatives
check consistency
refine their plan

The key difference is simple:

they spend compute before answering

What “Thinking Before Speaking” Really Means

“Thinking before speaking” is a useful phrase—but slightly misleading.

These models are still neural networks—not symbolic reasoners.

What changes is:

they are allowed to use more tokens for intermediate reasoning before producing the final answer

Different systems expose this differently:

hidden reasoning (OpenAI-style)
visible thinking blocks (Anthropic)
explicit reasoning traces (DeepSeek)

The interface varies. The principle does not.

A Useful Mental Model

Think of the model as writing a scratchpad internally:

User: What is 17 × 24?

[internal reasoning]
17 × 24 = 17 × 20 + 17 × 4
= 340 + 68
= 408

Final answer: 408

You may not see this reasoning—but it still consumes compute.

Reasoning tokens are real: they cost latency, money, and budget.

The Tradeoff: Depth vs Cost

More reasoning → better answers But also → slower and more expensive

This leads to a core design lever:

Reasoning Depth	Speed	Cost	When to Use
Low	Fast	Low	Simple tasks
Medium	Moderate	Medium	Multi-step reasoning
High	Slow	High	Planning, math, code

Do not pay for thinking when the task is trivial.

Do pay for it when mistakes are expensive.

Why This Changes Agent Design

Previously, agents had to externalize reasoning:

observe → reason → plan → act → reflect

Most of the intelligence lived outside the model, in loops.

Now, part of that loop moves inside the model call.

Phase	Earlier Agents	Reasoning Models
Reason	External loops	Internalized
Plan	Explicit prompting	Often internal
Reflect	Separate steps	Sometimes internal

This leads to a key shift:

fewer outer loops, more powerful inner calls

The Core Design Choice

You now have two ways to build intelligence:

1. Externalize it (classic agents)

more loops
more orchestration
cheaper models
more control

2. Internalize it (reasoning models)

fewer calls
deeper thinking per call
higher cost per step
simpler orchestration

Neither is universally better.

Fast models are best for:

routing
formatting
extraction
simple tool use

Reasoning models are best for:

planning
code reasoning
math and logic
evaluation and critique

The Practical Rule

Don’t ask:

“Should I use a reasoning model?”

Ask:

“Where is a wrong decision expensive?”

Use reasoning models there.

Examples:

choosing a multi-step plan
deciding if evidence is sufficient
reviewing code before execution
evaluating outputs

The Bigger Picture

The old scaling story was:

bigger models
more data
better training

The new story adds:

more inference compute
deeper reasoning
better performance on hard tasks

Intelligence is no longer only trained— it is also computed at runtime.

Python
Rust

# This code sample is not tested, and is provided just as reference
# Only Gemini and Ollama based code on this portal are tested
from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="o1",
    messages=[
        {
            "role": "user",
            "content": "Design a fault-tolerant multi-agent system."
        }
    ],
    reasoning_effort="high",  # 🔥 Increase internal reasoning depth
)

print(response.choices[0].message.content)

# Includes hidden reasoning tokens
print(response.usage)

// This code sample is not tested, and is provided just as reference
// Only Gemini and Ollama based code on this portal are tested
use reqwest::Client;
use serde_json::json;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = Client::new();

    let body = json!({
        "model": "o1",
        "messages": [{
            "role": "user",
            "content": "Design a fault-tolerant multi-agent system."
        }],
        "reasoning_effort": "high"
    });

    let response = client
        .post("https://api.openai.com/v1/chat/completions")
        .header("Authorization", format!("Bearer {}", std::env::var("OPENAI_API_KEY")?))
        .header("Content-Type", "application/json")
        .json(&body)
        .send()
        .await?
        .json::<serde_json::Value>()
        .await?;

    let msg = &response["choices"][0]["message"]["content"];

    println!("=== Answer ===\n{}", msg);

    Ok(())
}

Closing

Inference-time compute doesn’t replace agents.

It changes them.

Some reasoning moves inside the model. The rest still lives in your system.

The real skill now is deciding where each piece belongs.

→ Next up: LLM Basics and Primitives for Agents