Skip to content
AUTH

LLM Basics and Primitives for Agents

Most agents don’t fail because the model is weak. They fail because the builder doesn’t understand how the model actually works.

To build reliable agents, you need more than “LLMs generate text.” You need a mental model of how they represent information, handle context, and expose control surfaces.

In the previous article, we saw how reasoning models shift intelligence into inference-time compute.

This article focuses on the other side of that equation: the primitives you use to control, constrain, and integrate that intelligence into real systems.


Why Agent Builders Need an LLM Mental Model

An LLM is the reasoning and language engine inside many modern agents. It helps the system:

But the model is not magic. It has hard limits, strange failure modes, and specific interfaces. If you do not understand those, you end up building brittle agents that feel smart in demos and collapse in production.

A good agent builder needs to understand two layers:

  1. Conceptual layer
    How the model sees text, how it handles context, and how prompting shapes behavior.

  2. Operational layer
    The specific primitives you use to constrain outputs, wire tools into the loop, and manage memory.

We will cover both.


Tokenization: How LLMs Read Text

Computers do not read words directly. Before an LLM can process text, that text must be broken into smaller pieces called tokens.

A token is not always a whole word. Depending on the tokenizer, a token may be:

For example, a sentence like:

Hello, world!

might be split into tokens like:

["Hello", ",", " world", "!"]

And a more complex word might be split into subword units rather than one whole token.

This matters because tokens are the real unit of computation for LLMs.

Everything is measured in tokens:

For agents, this matters even more because agents often work with:

A badly designed agent can waste huge amounts of tokens on irrelevant context.


Embeddings: How LLM Systems Represent Meaning

Once text is tokenized, it must be represented numerically. One important representation used throughout LLM systems is the embedding.

An embedding is a dense vector that captures semantic information. You can think of it as a point in a high-dimensional space where semantically similar things lie closer together.

For example, embeddings for:

will tend to be closer than unrelated phrases.

Embeddings are especially important in agentic systems because they power capabilities such as:

So while the LLM itself generates language token by token, the broader agent system often uses embeddings to decide what information should be brought into the model’s context in the first place.

That is one of the key mental models for agent design:


Context: Why LLMs Need Help Remembering

An LLM does not have memory in the human sense. It does not continuously remember your conversation unless the relevant information is included in the current input.

This leads to one of the most important ideas in agent engineering:

Context Window

The context window is the maximum number of tokens the model can process in one call, including both input and output.

That limit is a hard technical bound. No matter how smart the model is, it cannot directly reason over text that is not inside the current context for that call.

If the important information is absent, the model cannot use it.

Context Management

Context management is the engineering problem of deciding what should go into that limited window.

This is where agent design becomes real engineering rather than just prompting.

You must decide:

These are separate concepts:

Context WindowContext Management
What it isHard model limitYour design strategy
Who controls itModel providerYou
NatureTechnical constraintSystem architecture problem
Question“How much fits?”“What is worth putting in?”

A lot of beginners confuse these. Bigger context windows help, but they do not solve the core design problem.


Bigger Context Is Not a Free Win

Modern models can support very large context windows. That sounds amazing, but in practice, “just stuff everything into the prompt” is often the wrong strategy.

Why?

Attention dilution

When context gets very large, important details can get lost among irrelevant ones. The model may ignore or misweight the information that actually matters.

Cost explosion

Tokens cost money. More context means more spend, often dramatically more.

Latency

Large prompts take longer to process. Multi-step agents already have loop overhead, so oversized context can make them painfully slow.

Noise accumulation

If you keep dumping tool outputs, stale plans, irrelevant history, and half-useful retrieval chunks into the prompt, the model gets a noisier picture of the task.

That is why good agents do retrieval and selection, not blind stuffing.


Working Memory vs Long Context

This is one of the most useful mental models for agent builders.

Working MemoryLong Context
AnalogyRAMExternal storage
In agentsWhat is in the prompt right nowInformation stored outside the prompt
SpeedInstantly available to the modelRequires retrieval
LimitHard token capEffectively much larger
Best forCurrent reasoning stateBackground knowledge and history

Working memory

Working memory is the active prompt. It contains things like:

This is the information the model can directly reason over right now.

Long context

Long context refers to information outside the current prompt, such as:

That information is useful, but it is not active until you retrieve the relevant parts and place them into working memory.

┌─────────────────────────────────────────────────────────┐
│ WORKING MEMORY (active context window) │
│ • System prompt │
│ • Current user request │
│ • Recent tool results │
│ • Current plan/state │
│ • Retrieved snippet for this step │
└─────────────────────────────────────────────────────────┘
│ retrieve relevant information
┌─────────────────────────────────────────────────────────┐
│ LONG CONTEXT (external memory) │
│ • Past sessions │
│ • Documentation │
│ • Knowledge base │
│ • User preferences │
│ • Archived observations │
└─────────────────────────────────────────────────────────┘

The job of the agent system is not merely to “have memory.” It is to retrieve the right memory at the right step.


Prompt Engineering: How You Steer the Model

If context is what the model sees, prompting is how you guide how it behaves.

Prompt engineering is the practice of designing model inputs so that the model produces more useful, reliable outputs.

At a conceptual level, prompting helps with:

Classic prompting patterns include:

Zero-shot prompting

You tell the model what to do without examples.

Example:

Classify this email as spam or not spam.

Few-shot prompting

You provide a few examples to teach the pattern.

This is useful when the task is subtle and examples communicate the expected behavior more clearly than plain instructions.

Chain-of-thought style prompting

You encourage the model to reason through a problem step by step.

This can improve performance on tasks that require decomposition, though in production systems you often want the benefits of reasoning without exposing unnecessary internal reasoning in user-facing outputs.

Role prompting

You assign the model a role such as researcher, tutor, analyst, or coding assistant.

This can help stabilize tone, behavior, and assumptions.

For agents, prompting is not just about getting a nicer answer. It is often how you define the behavior of the entire agent loop.


System Prompting in Production: The PAG Framework

In real agents, the most important prompt is usually the system prompt.

A production system prompt is not just a vibe-setting sentence like “you are a helpful assistant.” It is closer to an operational contract.

A useful framework here is:

PAG = Persona + Action + Guardrail

DimensionWhat It DefinesExample
PersonaWho the agent is“You are a senior Rust systems engineer.”
ActionWhat the agent is supposed to do“You help debug Rust code and may use documentation tools.”
GuardrailWhat it must not do“Do not fabricate APIs. If unsure, say so.”

This structure is simple, but it captures the three things that matter most:

from ollama import chat
from ollama import ChatResponse
# -----------------------------
# Ollama details
# -----------------------------
MODEL = "qwen3.5:9b"
# -----------------------------
# System Persona & Guardrails
# -----------------------------
SYSTEM_PROMPT = """
## Persona
You are Ferris, an expert AI assistant specializing in Rust programming and
systems engineering. You are precise, concise, and prefer working examples
over abstract explanations.
## Action
You help developers with:
- Writing and debugging Rust code
- Explaining Rust concepts
- Recommending crates from the Rust ecosystem
You may search documentation and run code examples to verify answers.
## Guardrail
- Do not answer questions outside Rust and systems programming.
- If unsure, say so explicitly.
- Never fabricate crate names or API signatures.
- Do not execute code that modifies the filesystem.
"""
def run_rust_expert(user_query: str):
response: ChatResponse = chat(model=MODEL, messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{
'role': 'user',
'content': user_query,
},
],
options= {
"temperature": 0.2
},
stream = False
)
return response['message']['content']
if __name__ == "__main__":
query = "Please explain string slices in Rust?"
print("Querying Ferris...\n")
answer = run_rust_expert(query)
print("-" * 30)
print("Ferris's Response:")
print(answer)

Structured Outputs: Making LLM Responses Reliable

One of the fastest ways to break an agent is to rely on unstructured text where your application expects machine-readable state.

A lot of people start with prompts like:

Return the answer as JSON.

That works sometimes. Then it breaks at the worst possible time.

For agents, structured outputs matter because the model often needs to return:

This is why schema-enforced structured output is such a foundational primitive.

Instead of hoping the model returns the right JSON shape, you define the shape explicitly and enforce it.

from ollama import chat
from pydantic import BaseModel
MODEL = "qwen3.5:9b"
# -----------------------------
# Schema
# -----------------------------
class ResearchResult(BaseModel):
topic: str
summary: str
sources: list[str]
confidence: float
needs_more_research: bool
# -----------------------------
# Run query
# -----------------------------
response = chat(
model=MODEL,
messages=[
{"role": "system", "content": "You are a research assistant."},
{"role": "user", "content": "Research the current state of MCP (Model Context Protocol)."},
],
format=ResearchResult.model_json_schema(),
options={"temperature": 0.2},
)
# -----------------------------
# Parse response
# -----------------------------
result = ResearchResult.model_validate_json(response.message.content)
print(f"Topic: {result.topic}")
print(f"Summary: {result.summary}")
print(f"Confidence: {result.confidence:.0%}")
print(f"Needs more research: {result.needs_more_research}")
print(f"Sources: {result.sources}")

Function Calling: How Agents Use Tools

An agent becomes much more useful when it can use tools.

But the model does not literally execute your code. Instead, it produces a structured request saying, in effect:

Then your application executes the tool and sends the result back into the conversation.

That mechanism is called function calling or tool calling.

This is one of the core primitives of modern agent systems because it lets the model move beyond text generation into interaction with the outside world.

Why tool calling matters

Tool calling lets the model:

Without tool use, the model is limited to what is already in its parameters and prompt.

With tool use, it becomes part of a larger decision-and-action system.


Tool Schemas: Teaching the Model What Tools Exist

A tool is only as usable as its schema.

The model needs to know:

That information is defined in the tool schema.

We have already seen tool calling in our previous examples. Let’s take a look at them again.

Python Example

Rust Example


Prompting, Tools, and Context Work Together

By this point, the three-way relationship should be clear:

These are not separate topics in production. They are tightly coupled.

A badly scoped prompt can cause bad tool use. A weak tool schema can cause wrong arguments. A noisy context can cause confused reasoning. A missing memory snippet can make the model act like it “forgot” something important.

That is why agent design is mostly about getting the interfaces between these pieces right.


The Practical Capability Stack for LLM-Based Agents

At this point, you can see that LLMs are not standalone systems. They are components inside a larger control loop.

1. Text enters the system
→ tokenized into tokens
2. Relevant knowledge is found
→ often through embeddings and retrieval
3. A prompt is built
→ system instructions + user input + state + retrieved context
4. The model responds
→ either with text, structured output, or a tool call
5. Your application executes the result
→ parse schema, run tool, update state, manage memory
6. The loop continues
→ until the task is complete

This is the real bridge between “LLM basics” and “agent primitives.” One is the mental model; the other is the implementation surface.


Wrapping Up

To build reliable agents, you need more than a vague idea of how LLMs work.

You need to understand:

That is the foundation of modern LLM-based agent engineering.

LLMs are not magic.

They are constrained systems with:

The difference between a demo and a production agent is how well you design around those constraints.

In the next article, you can go one step deeper into how these primitives combine inside a full agent runtime — planner, executor, memory, state, and feedback loop.

Next up: Internal Agent Architecture