Skip to content
AUTH

The Inference-Time Compute Revolution

In early large language models, improvement followed a simple rule:

train bigger models on more data with more compute

Intelligence was scaled almost entirely at training time.

That still matters—but modern reasoning models introduce a second axis:

scaling at inference time

This shift fundamentally changes how we design systems, especially agents.


From Instant Answers to Deliberate Reasoning

Classic LLMs behave like fast predictors:

prompt → answer

They generate the most likely continuation immediately.

This is powerful—but flawed. The model can:


Reasoning models change the flow:

prompt → internal reasoning → answer

Instead of answering immediately, they:

The key difference is simple:

they spend compute before answering


What “Thinking Before Speaking” Really Means

“Thinking before speaking” is a useful phrase—but slightly misleading.

These models are still neural networks—not symbolic reasoners.

What changes is:

they are allowed to use more tokens for intermediate reasoning before producing the final answer

Different systems expose this differently:

The interface varies. The principle does not.


A Useful Mental Model

Think of the model as writing a scratchpad internally:

User: What is 17 × 24?
[internal reasoning]
17 × 24 = 17 × 20 + 17 × 4
= 340 + 68
= 408
Final answer: 408

You may not see this reasoning—but it still consumes compute.

Reasoning tokens are real: they cost latency, money, and budget.


The Tradeoff: Depth vs Cost

More reasoning → better answers But also → slower and more expensive

This leads to a core design lever:

Reasoning DepthSpeedCostWhen to Use
LowFastLowSimple tasks
MediumModerateMediumMulti-step reasoning
HighSlowHighPlanning, math, code

Do not pay for thinking when the task is trivial.

Do pay for it when mistakes are expensive.


Why This Changes Agent Design

Previously, agents had to externalize reasoning:

observe → reason → plan → act → reflect

Most of the intelligence lived outside the model, in loops.


Now, part of that loop moves inside the model call.

PhaseEarlier AgentsReasoning Models
ReasonExternal loopsInternalized
PlanExplicit promptingOften internal
ReflectSeparate stepsSometimes internal

This leads to a key shift:

fewer outer loops, more powerful inner calls


The Core Design Choice

You now have two ways to build intelligence:

1. Externalize it (classic agents)

2. Internalize it (reasoning models)


Neither is universally better.

Fast models are best for:

Reasoning models are best for:


The Practical Rule

Don’t ask:

“Should I use a reasoning model?”

Ask:

“Where is a wrong decision expensive?”

Use reasoning models there.

Examples:


The Bigger Picture

The old scaling story was:

The new story adds:

Intelligence is no longer only trained— it is also computed at runtime.

# This code sample is not tested, and is provided just as reference
# Only Gemini and Ollama based code on this portal are tested
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="o1",
messages=[
{
"role": "user",
"content": "Design a fault-tolerant multi-agent system."
}
],
reasoning_effort="high", # 🔥 Increase internal reasoning depth
)
print(response.choices[0].message.content)
# Includes hidden reasoning tokens
print(response.usage)

Closing

Inference-time compute doesn’t replace agents.

It changes them.

Some reasoning moves inside the model. The rest still lives in your system.

The real skill now is deciding where each piece belongs.

Next up: LLM Basics and Primitives for Agents