LLM Basics and Primitives for Agents

Most agents don’t fail because the model is weak. They fail because the builder doesn’t understand how the model actually works.

To build reliable agents, you need more than “LLMs generate text.” You need a mental model of how they represent information, handle context, and expose control surfaces.

In the previous article, we saw how reasoning models shift intelligence into inference-time compute.

This article focuses on the other side of that equation: the primitives you use to control, constrain, and integrate that intelligence into real systems.

Why Agent Builders Need an LLM Mental Model

An LLM is the reasoning and language engine inside many modern agents. It helps the system:

interpret user intent
reason about the next step
decide whether to use a tool
transform raw tool outputs into useful conclusions
maintain coherent multi-step behavior

But the model is not magic. It has hard limits, strange failure modes, and specific interfaces. If you do not understand those, you end up building brittle agents that feel smart in demos and collapse in production.

A good agent builder needs to understand two layers:

Conceptual layer
How the model sees text, how it handles context, and how prompting shapes behavior.
Operational layer
The specific primitives you use to constrain outputs, wire tools into the loop, and manage memory.

We will cover both.

Tokenization: How LLMs Read Text

Computers do not read words directly. Before an LLM can process text, that text must be broken into smaller pieces called tokens.

A token is not always a whole word. Depending on the tokenizer, a token may be:

a full word
part of a word
punctuation
whitespace-prefixed text
sometimes even a short character sequence

For example, a sentence like:

Hello, world!

might be split into tokens like:

["Hello", ",", " world", "!"]

And a more complex word might be split into subword units rather than one whole token.

This matters because tokens are the real unit of computation for LLMs.

Everything is measured in tokens:

input size
output size
context window usage
latency
cost

For agents, this matters even more because agents often work with:

long user conversations
tool outputs
documents
logs
retrieved chunks from memory systems
structured state

A badly designed agent can waste huge amounts of tokens on irrelevant context.

Embeddings: How LLM Systems Represent Meaning

Once text is tokenized, it must be represented numerically. One important representation used throughout LLM systems is the embedding.

An embedding is a dense vector that captures semantic information. You can think of it as a point in a high-dimensional space where semantically similar things lie closer together.

For example, embeddings for:

“doctor” and “physician”
“car” and “vehicle”
“Rust async programming” and “Tokio tutorial”

will tend to be closer than unrelated phrases.

Embeddings are especially important in agentic systems because they power capabilities such as:

semantic search
retrieval-augmented generation
memory lookup
document matching
intent clustering
relevance ranking

So while the LLM itself generates language token by token, the broader agent system often uses embeddings to decide what information should be brought into the model’s context in the first place.

That is one of the key mental models for agent design:

Context: Why LLMs Need Help Remembering

An LLM does not have memory in the human sense. It does not continuously remember your conversation unless the relevant information is included in the current input.

This leads to one of the most important ideas in agent engineering:

Context Window

The context window is the maximum number of tokens the model can process in one call, including both input and output.

That limit is a hard technical bound. No matter how smart the model is, it cannot directly reason over text that is not inside the current context for that call.

If the important information is absent, the model cannot use it.

Context Management

Context management is the engineering problem of deciding what should go into that limited window.

This is where agent design becomes real engineering rather than just prompting.

You must decide:

what past conversation to include
what tool results to keep
what retrieved documents matter
what state must be explicitly represented
what old information should be removed or summarized

These are separate concepts:

	Context Window	Context Management
What it is	Hard model limit	Your design strategy
Who controls it	Model provider	You
Nature	Technical constraint	System architecture problem
Question	“How much fits?”	“What is worth putting in?”

A lot of beginners confuse these. Bigger context windows help, but they do not solve the core design problem.

Bigger Context Is Not a Free Win

Modern models can support very large context windows. That sounds amazing, but in practice, “just stuff everything into the prompt” is often the wrong strategy.

Why?

Attention dilution

When context gets very large, important details can get lost among irrelevant ones. The model may ignore or misweight the information that actually matters.

Cost explosion

Tokens cost money. More context means more spend, often dramatically more.

Latency

Large prompts take longer to process. Multi-step agents already have loop overhead, so oversized context can make them painfully slow.

Noise accumulation

If you keep dumping tool outputs, stale plans, irrelevant history, and half-useful retrieval chunks into the prompt, the model gets a noisier picture of the task.

That is why good agents do retrieval and selection, not blind stuffing.

Working Memory vs Long Context

This is one of the most useful mental models for agent builders.

	Working Memory	Long Context
Analogy	RAM	External storage
In agents	What is in the prompt right now	Information stored outside the prompt
Speed	Instantly available to the model	Requires retrieval
Limit	Hard token cap	Effectively much larger
Best for	Current reasoning state	Background knowledge and history

Working memory

Working memory is the active prompt. It contains things like:

system prompt
current user request
latest tool outputs
current plan
selected memory snippets
structured task state

This is the information the model can directly reason over right now.

Long context

Long context refers to information outside the current prompt, such as:

vector database contents
documentation store
past session history
knowledge base
user preferences
archived task traces

That information is useful, but it is not active until you retrieve the relevant parts and place them into working memory.

┌─────────────────────────────────────────────────────────┐
│ WORKING MEMORY (active context window)                 │
│ • System prompt                                        │
│ • Current user request                                 │
│ • Recent tool results                                  │
│ • Current plan/state                                   │
│ • Retrieved snippet for this step                      │
└─────────────────────────────────────────────────────────┘
                          ▲
                          │ retrieve relevant information
                          │
┌─────────────────────────────────────────────────────────┐
│ LONG CONTEXT (external memory)                         │
│ • Past sessions                                        │
│ • Documentation                                        │
│ • Knowledge base                                       │
│ • User preferences                                     │
│ • Archived observations                                │
└─────────────────────────────────────────────────────────┘

The job of the agent system is not merely to “have memory.” It is to retrieve the right memory at the right step.

Prompt Engineering: How You Steer the Model

If context is what the model sees, prompting is how you guide how it behaves.

Prompt engineering is the practice of designing model inputs so that the model produces more useful, reliable outputs.

At a conceptual level, prompting helps with:

shaping behavior
constraining task scope
providing instructions
encouraging reasoning style
specifying format
defining success criteria

Classic prompting patterns include:

Zero-shot prompting

You tell the model what to do without examples.

Example:

Classify this email as spam or not spam.

Few-shot prompting

You provide a few examples to teach the pattern.

This is useful when the task is subtle and examples communicate the expected behavior more clearly than plain instructions.

Chain-of-thought style prompting

You encourage the model to reason through a problem step by step.

This can improve performance on tasks that require decomposition, though in production systems you often want the benefits of reasoning without exposing unnecessary internal reasoning in user-facing outputs.

Role prompting

You assign the model a role such as researcher, tutor, analyst, or coding assistant.

This can help stabilize tone, behavior, and assumptions.

For agents, prompting is not just about getting a nicer answer. It is often how you define the behavior of the entire agent loop.

System Prompting in Production: The PAG Framework

In real agents, the most important prompt is usually the system prompt.

A production system prompt is not just a vibe-setting sentence like “you are a helpful assistant.” It is closer to an operational contract.

A useful framework here is:

PAG = Persona + Action + Guardrail

Dimension	What It Defines	Example
Persona	Who the agent is	“You are a senior Rust systems engineer.”
Action	What the agent is supposed to do	“You help debug Rust code and may use documentation tools.”
Guardrail	What it must not do	“Do not fabricate APIs. If unsure, say so.”

This structure is simple, but it captures the three things that matter most:

identity
permissions
boundaries

Python
Rust

from ollama import chat
from ollama import ChatResponse

# -----------------------------
# Ollama details
# -----------------------------
MODEL = "qwen3.5:9b"

# -----------------------------
# System Persona & Guardrails
# -----------------------------
SYSTEM_PROMPT = """
## Persona
You are Ferris, an expert AI assistant specializing in Rust programming and
systems engineering. You are precise, concise, and prefer working examples
over abstract explanations.

## Action
You help developers with:
- Writing and debugging Rust code
- Explaining Rust concepts
- Recommending crates from the Rust ecosystem
You may search documentation and run code examples to verify answers.

## Guardrail
- Do not answer questions outside Rust and systems programming.
- If unsure, say so explicitly.
- Never fabricate crate names or API signatures.
- Do not execute code that modifies the filesystem.
"""

def run_rust_expert(user_query: str):
    response: ChatResponse = chat(model=MODEL, messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {
          'role': 'user',
          'content': user_query,
        },
      ],
      options= {
          "temperature": 0.2
      },
      stream = False
    )
    return response['message']['content']

if __name__ == "__main__":
    query = "Please explain string slices in Rust?"

    print("Querying Ferris...\n")
    answer = run_rust_expert(query)

    print("-" * 30)
    print("Ferris's Response:")
    print(answer)

use ollama_rs::{
    generation::chat::{ChatMessage},
    Ollama,
};
use ollama_rs::generation::chat::request::ChatMessageRequest;

// -----------------------------
// System Persona & Guardrails
// -----------------------------
const SYSTEM_PROMPT: &str = r#"
## Persona
You are Ferris, an expert AI assistant specializing in Rust programming and
systems engineering. You are precise, concise, and prefer working examples
over abstract explanations.

## Action
You help developers with:
- Writing and debugging Rust code
- Explaining Rust concepts
- Recommending crates from the Rust ecosystem

## Guardrail
- Do not answer questions outside Rust and systems programming.
- If unsure, say so explicitly.
- Never fabricate crate names or API signatures.
- Do not execute code that modifies the filesystem.
"#;

pub async fn run() -> Result<(), Box<dyn std::error::Error>> {
    let ollama = Ollama::default();

    let prompt = "What's the best crate for async HTTP in Rust?";
    println!("User: {}\n", prompt);

    // 👇 roles are plain strings
    let messages = vec![
        ChatMessage::system(SYSTEM_PROMPT.to_string()),
        ChatMessage::user(prompt.to_string()),
    ];

    // 👇 build request object
    let request = ChatMessageRequest::new("qwen3.5:9b".to_string(), messages);

    let response = ollama.send_chat_messages(request).await?;

    println!("Ferris: {}", response.message.content);

    Ok(())
}

Structured Outputs: Making LLM Responses Reliable

One of the fastest ways to break an agent is to rely on unstructured text where your application expects machine-readable state.

A lot of people start with prompts like:

Return the answer as JSON.

That works sometimes. Then it breaks at the worst possible time.

For agents, structured outputs matter because the model often needs to return:

task state
classifications
extracted fields
planner decisions
routing choices
tool-ready arguments
status flags
confidence indicators

This is why schema-enforced structured output is such a foundational primitive.

Instead of hoping the model returns the right JSON shape, you define the shape explicitly and enforce it.

Python
Rust

from ollama import chat
from pydantic import BaseModel

MODEL = "qwen3.5:9b"

# -----------------------------
# Schema
# -----------------------------
class ResearchResult(BaseModel):
    topic: str
    summary: str
    sources: list[str]
    confidence: float
    needs_more_research: bool

# -----------------------------
# Run query
# -----------------------------
response = chat(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a research assistant."},
        {"role": "user", "content": "Research the current state of MCP (Model Context Protocol)."},
    ],
    format=ResearchResult.model_json_schema(),
    options={"temperature": 0.2},
)

# -----------------------------
# Parse response
# -----------------------------
result = ResearchResult.model_validate_json(response.message.content)

print(f"Topic: {result.topic}")
print(f"Summary: {result.summary}")
print(f"Confidence: {result.confidence:.0%}")
print(f"Needs more research: {result.needs_more_research}")
print(f"Sources: {result.sources}")

use ollama_rs::{
    generation::chat::{ChatMessage, request::ChatMessageRequest},
    Ollama,
};
use serde::{Deserialize, Serialize};

#[derive(Debug, Deserialize, Serialize)]
struct ResearchResult {
    topic: String,
    summary: String,
    sources: Vec<String>,
    confidence: f32,
    needs_more_research: bool,
}

// -----------------------------
// System Prompt (force JSON)
// -----------------------------
const SYSTEM_PROMPT: &str = r#"
You are a research assistant.

Return ONLY valid JSON matching this schema:

{
  "topic": string,
  "summary": string,
  "sources": string[],
  "confidence": number (0 to 1),
  "needs_more_research": boolean
}

Do not include any explanation or extra text.
"#;

pub async fn run() -> Result<(), Box<dyn std::error::Error>> {
    let ollama = Ollama::default();

    let messages = vec![
        ChatMessage::system(SYSTEM_PROMPT.to_string()),
        ChatMessage::user("Research the current state of MCP (Model Context Protocol).".to_string()),
    ];

    let request = ChatMessageRequest::new("qwen3.5:9b".to_string(), messages);

    let response = ollama.send_chat_messages(request).await?;

    let content = response.message.content;

    // -----------------------------
    // Parse JSON → serde
    // -----------------------------
    let result: ResearchResult = serde_json::from_str(&content)?;

    println!("Topic: {}", result.topic);
    println!("Summary: {}", result.summary);
    println!("Confidence: {:.0}%", result.confidence * 100.0);
    println!("Needs more research: {}", result.needs_more_research);
    println!("Sources: {:?}", result.sources);

    Ok(())
}

Function Calling: How Agents Use Tools

An agent becomes much more useful when it can use tools.

But the model does not literally execute your code. Instead, it produces a structured request saying, in effect:

which tool it wants to use
what arguments it wants to pass

Then your application executes the tool and sends the result back into the conversation.

That mechanism is called function calling or tool calling.

This is one of the core primitives of modern agent systems because it lets the model move beyond text generation into interaction with the outside world.

Why tool calling matters

Tool calling lets the model:

search for fresh information
run calculations
query APIs
inspect databases
execute code
interact with files
perform workflow steps

Without tool use, the model is limited to what is already in its parameters and prompt.

With tool use, it becomes part of a larger decision-and-action system.

Tool Schemas: Teaching the Model What Tools Exist

A tool is only as usable as its schema.

The model needs to know:

the tool name
what the tool does
when it should be used
what arguments it expects
what those arguments mean

That information is defined in the tool schema.

We have already seen tool calling in our previous examples. Let’s take a look at them again.

Python Example

Rust Example

Prompting, Tools, and Context Work Together

By this point, the three-way relationship should be clear:

Prompting defines how the model should behave
Tools define what the model is allowed to do
Context management defines what the model knows right now

These are not separate topics in production. They are tightly coupled.

A badly scoped prompt can cause bad tool use. A weak tool schema can cause wrong arguments. A noisy context can cause confused reasoning. A missing memory snippet can make the model act like it “forgot” something important.

That is why agent design is mostly about getting the interfaces between these pieces right.

The Practical Capability Stack for LLM-Based Agents

At this point, you can see that LLMs are not standalone systems. They are components inside a larger control loop.

1. Text enters the system
   → tokenized into tokens

2. Relevant knowledge is found
   → often through embeddings and retrieval

3. A prompt is built
   → system instructions + user input + state + retrieved context

4. The model responds
   → either with text, structured output, or a tool call

5. Your application executes the result
   → parse schema, run tool, update state, manage memory

6. The loop continues
   → until the task is complete

This is the real bridge between “LLM basics” and “agent primitives.” One is the mental model; the other is the implementation surface.

Wrapping Up

To build reliable agents, you need more than a vague idea of how LLMs work.

You need to understand:

tokenization, because tokens are the real unit of cost and context
embeddings, because retrieval and memory systems depend on semantic similarity
context windows, because the model can only reason over what is currently present
context management, because deciding what belongs in the prompt is your job
working memory vs long context, because not all memory should be active at once
prompt engineering, because prompts shape behavior
system prompting with PAG, because production agents need identity, permissions, and boundaries
structured outputs, because agents need reliable machine-readable state
function calling and tool schemas, because agents act through tools, not just text

That is the foundation of modern LLM-based agent engineering.

LLMs are not magic.

They are constrained systems with:

limited context
probabilistic behavior
structured interfaces

The difference between a demo and a production agent is how well you design around those constraints.

In the next article, you can go one step deeper into how these primitives combine inside a full agent runtime — planner, executor, memory, state, and feedback loop.

→ Next up: Internal Agent Architecture