LLM Basics and Primitives for Agents
Most agents don’t fail because the model is weak. They fail because the builder doesn’t understand how the model actually works.
To build reliable agents, you need more than “LLMs generate text.” You need a mental model of how they represent information, handle context, and expose control surfaces.
In the previous article, we saw how reasoning models shift intelligence into inference-time compute.
This article focuses on the other side of that equation: the primitives you use to control, constrain, and integrate that intelligence into real systems.
Why Agent Builders Need an LLM Mental Model
An LLM is the reasoning and language engine inside many modern agents. It helps the system:
- interpret user intent
- reason about the next step
- decide whether to use a tool
- transform raw tool outputs into useful conclusions
- maintain coherent multi-step behavior
But the model is not magic. It has hard limits, strange failure modes, and specific interfaces. If you do not understand those, you end up building brittle agents that feel smart in demos and collapse in production.
A good agent builder needs to understand two layers:
-
Conceptual layer
How the model sees text, how it handles context, and how prompting shapes behavior. -
Operational layer
The specific primitives you use to constrain outputs, wire tools into the loop, and manage memory.
We will cover both.
Tokenization: How LLMs Read Text
Computers do not read words directly. Before an LLM can process text, that text must be broken into smaller pieces called tokens.
A token is not always a whole word. Depending on the tokenizer, a token may be:
- a full word
- part of a word
- punctuation
- whitespace-prefixed text
- sometimes even a short character sequence
For example, a sentence like:
Hello, world!might be split into tokens like:
["Hello", ",", " world", "!"]And a more complex word might be split into subword units rather than one whole token.
This matters because tokens are the real unit of computation for LLMs.
Everything is measured in tokens:
- input size
- output size
- context window usage
- latency
- cost
For agents, this matters even more because agents often work with:
- long user conversations
- tool outputs
- documents
- logs
- retrieved chunks from memory systems
- structured state
A badly designed agent can waste huge amounts of tokens on irrelevant context.
Embeddings: How LLM Systems Represent Meaning
Once text is tokenized, it must be represented numerically. One important representation used throughout LLM systems is the embedding.
An embedding is a dense vector that captures semantic information. You can think of it as a point in a high-dimensional space where semantically similar things lie closer together.
For example, embeddings for:
- “doctor” and “physician”
- “car” and “vehicle”
- “Rust async programming” and “Tokio tutorial”
will tend to be closer than unrelated phrases.
Embeddings are especially important in agentic systems because they power capabilities such as:
- semantic search
- retrieval-augmented generation
- memory lookup
- document matching
- intent clustering
- relevance ranking
So while the LLM itself generates language token by token, the broader agent system often uses embeddings to decide what information should be brought into the model’s context in the first place.
That is one of the key mental models for agent design:
Context: Why LLMs Need Help Remembering
An LLM does not have memory in the human sense. It does not continuously remember your conversation unless the relevant information is included in the current input.
This leads to one of the most important ideas in agent engineering:
Context Window
The context window is the maximum number of tokens the model can process in one call, including both input and output.
That limit is a hard technical bound. No matter how smart the model is, it cannot directly reason over text that is not inside the current context for that call.
If the important information is absent, the model cannot use it.
Context Management
Context management is the engineering problem of deciding what should go into that limited window.
This is where agent design becomes real engineering rather than just prompting.
You must decide:
- what past conversation to include
- what tool results to keep
- what retrieved documents matter
- what state must be explicitly represented
- what old information should be removed or summarized
These are separate concepts:
| Context Window | Context Management | |
|---|---|---|
| What it is | Hard model limit | Your design strategy |
| Who controls it | Model provider | You |
| Nature | Technical constraint | System architecture problem |
| Question | “How much fits?” | “What is worth putting in?” |
A lot of beginners confuse these. Bigger context windows help, but they do not solve the core design problem.
Bigger Context Is Not a Free Win
Modern models can support very large context windows. That sounds amazing, but in practice, “just stuff everything into the prompt” is often the wrong strategy.
Why?
Attention dilution
When context gets very large, important details can get lost among irrelevant ones. The model may ignore or misweight the information that actually matters.
Cost explosion
Tokens cost money. More context means more spend, often dramatically more.
Latency
Large prompts take longer to process. Multi-step agents already have loop overhead, so oversized context can make them painfully slow.
Noise accumulation
If you keep dumping tool outputs, stale plans, irrelevant history, and half-useful retrieval chunks into the prompt, the model gets a noisier picture of the task.
That is why good agents do retrieval and selection, not blind stuffing.
Working Memory vs Long Context
This is one of the most useful mental models for agent builders.
| Working Memory | Long Context | |
|---|---|---|
| Analogy | RAM | External storage |
| In agents | What is in the prompt right now | Information stored outside the prompt |
| Speed | Instantly available to the model | Requires retrieval |
| Limit | Hard token cap | Effectively much larger |
| Best for | Current reasoning state | Background knowledge and history |
Working memory
Working memory is the active prompt. It contains things like:
- system prompt
- current user request
- latest tool outputs
- current plan
- selected memory snippets
- structured task state
This is the information the model can directly reason over right now.
Long context
Long context refers to information outside the current prompt, such as:
- vector database contents
- documentation store
- past session history
- knowledge base
- user preferences
- archived task traces
That information is useful, but it is not active until you retrieve the relevant parts and place them into working memory.
┌─────────────────────────────────────────────────────────┐│ WORKING MEMORY (active context window) ││ • System prompt ││ • Current user request ││ • Recent tool results ││ • Current plan/state ││ • Retrieved snippet for this step │└─────────────────────────────────────────────────────────┘ ▲ │ retrieve relevant information │┌─────────────────────────────────────────────────────────┐│ LONG CONTEXT (external memory) ││ • Past sessions ││ • Documentation ││ • Knowledge base ││ • User preferences ││ • Archived observations │└─────────────────────────────────────────────────────────┘The job of the agent system is not merely to “have memory.” It is to retrieve the right memory at the right step.
Prompt Engineering: How You Steer the Model
If context is what the model sees, prompting is how you guide how it behaves.
Prompt engineering is the practice of designing model inputs so that the model produces more useful, reliable outputs.
At a conceptual level, prompting helps with:
- shaping behavior
- constraining task scope
- providing instructions
- encouraging reasoning style
- specifying format
- defining success criteria
Classic prompting patterns include:
Zero-shot prompting
You tell the model what to do without examples.
Example:
Classify this email as spam or not spam.Few-shot prompting
You provide a few examples to teach the pattern.
This is useful when the task is subtle and examples communicate the expected behavior more clearly than plain instructions.
Chain-of-thought style prompting
You encourage the model to reason through a problem step by step.
This can improve performance on tasks that require decomposition, though in production systems you often want the benefits of reasoning without exposing unnecessary internal reasoning in user-facing outputs.
Role prompting
You assign the model a role such as researcher, tutor, analyst, or coding assistant.
This can help stabilize tone, behavior, and assumptions.
For agents, prompting is not just about getting a nicer answer. It is often how you define the behavior of the entire agent loop.
System Prompting in Production: The PAG Framework
In real agents, the most important prompt is usually the system prompt.
A production system prompt is not just a vibe-setting sentence like “you are a helpful assistant.” It is closer to an operational contract.
A useful framework here is:
PAG = Persona + Action + Guardrail
| Dimension | What It Defines | Example |
|---|---|---|
| Persona | Who the agent is | “You are a senior Rust systems engineer.” |
| Action | What the agent is supposed to do | “You help debug Rust code and may use documentation tools.” |
| Guardrail | What it must not do | “Do not fabricate APIs. If unsure, say so.” |
This structure is simple, but it captures the three things that matter most:
- identity
- permissions
- boundaries
from ollama import chatfrom ollama import ChatResponse
# -----------------------------# Ollama details# -----------------------------MODEL = "qwen3.5:9b"
# -----------------------------# System Persona & Guardrails# -----------------------------SYSTEM_PROMPT = """## PersonaYou are Ferris, an expert AI assistant specializing in Rust programming andsystems engineering. You are precise, concise, and prefer working examplesover abstract explanations.
## ActionYou help developers with:- Writing and debugging Rust code- Explaining Rust concepts- Recommending crates from the Rust ecosystemYou may search documentation and run code examples to verify answers.
## Guardrail- Do not answer questions outside Rust and systems programming.- If unsure, say so explicitly.- Never fabricate crate names or API signatures.- Do not execute code that modifies the filesystem."""
def run_rust_expert(user_query: str): response: ChatResponse = chat(model=MODEL, messages=[ {"role": "system", "content": SYSTEM_PROMPT}, { 'role': 'user', 'content': user_query, }, ], options= { "temperature": 0.2 }, stream = False ) return response['message']['content']
if __name__ == "__main__": query = "Please explain string slices in Rust?"
print("Querying Ferris...\n") answer = run_rust_expert(query)
print("-" * 30) print("Ferris's Response:") print(answer)use ollama_rs::{ generation::chat::{ChatMessage}, Ollama,};use ollama_rs::generation::chat::request::ChatMessageRequest;
// -----------------------------// System Persona & Guardrails// -----------------------------const SYSTEM_PROMPT: &str = r#"## PersonaYou are Ferris, an expert AI assistant specializing in Rust programming andsystems engineering. You are precise, concise, and prefer working examplesover abstract explanations.
## ActionYou help developers with:- Writing and debugging Rust code- Explaining Rust concepts- Recommending crates from the Rust ecosystem
## Guardrail- Do not answer questions outside Rust and systems programming.- If unsure, say so explicitly.- Never fabricate crate names or API signatures.- Do not execute code that modifies the filesystem."#;
pub async fn run() -> Result<(), Box<dyn std::error::Error>> { let ollama = Ollama::default();
let prompt = "What's the best crate for async HTTP in Rust?"; println!("User: {}\n", prompt);
// 👇 roles are plain strings let messages = vec![ ChatMessage::system(SYSTEM_PROMPT.to_string()), ChatMessage::user(prompt.to_string()), ];
// 👇 build request object let request = ChatMessageRequest::new("qwen3.5:9b".to_string(), messages);
let response = ollama.send_chat_messages(request).await?;
println!("Ferris: {}", response.message.content);
Ok(())}Structured Outputs: Making LLM Responses Reliable
One of the fastest ways to break an agent is to rely on unstructured text where your application expects machine-readable state.
A lot of people start with prompts like:
Return the answer as JSON.
That works sometimes. Then it breaks at the worst possible time.
For agents, structured outputs matter because the model often needs to return:
- task state
- classifications
- extracted fields
- planner decisions
- routing choices
- tool-ready arguments
- status flags
- confidence indicators
This is why schema-enforced structured output is such a foundational primitive.
Instead of hoping the model returns the right JSON shape, you define the shape explicitly and enforce it.
from ollama import chatfrom pydantic import BaseModel
MODEL = "qwen3.5:9b"
# -----------------------------# Schema# -----------------------------class ResearchResult(BaseModel): topic: str summary: str sources: list[str] confidence: float needs_more_research: bool
# -----------------------------# Run query# -----------------------------response = chat( model=MODEL, messages=[ {"role": "system", "content": "You are a research assistant."}, {"role": "user", "content": "Research the current state of MCP (Model Context Protocol)."}, ], format=ResearchResult.model_json_schema(), options={"temperature": 0.2},)
# -----------------------------# Parse response# -----------------------------result = ResearchResult.model_validate_json(response.message.content)
print(f"Topic: {result.topic}")print(f"Summary: {result.summary}")print(f"Confidence: {result.confidence:.0%}")print(f"Needs more research: {result.needs_more_research}")print(f"Sources: {result.sources}")use ollama_rs::{ generation::chat::{ChatMessage, request::ChatMessageRequest}, Ollama,};use serde::{Deserialize, Serialize};
#[derive(Debug, Deserialize, Serialize)]struct ResearchResult { topic: String, summary: String, sources: Vec<String>, confidence: f32, needs_more_research: bool,}
// -----------------------------// System Prompt (force JSON)// -----------------------------const SYSTEM_PROMPT: &str = r#"You are a research assistant.
Return ONLY valid JSON matching this schema:
{ "topic": string, "summary": string, "sources": string[], "confidence": number (0 to 1), "needs_more_research": boolean}
Do not include any explanation or extra text."#;
pub async fn run() -> Result<(), Box<dyn std::error::Error>> { let ollama = Ollama::default();
let messages = vec![ ChatMessage::system(SYSTEM_PROMPT.to_string()), ChatMessage::user("Research the current state of MCP (Model Context Protocol).".to_string()), ];
let request = ChatMessageRequest::new("qwen3.5:9b".to_string(), messages);
let response = ollama.send_chat_messages(request).await?;
let content = response.message.content;
// ----------------------------- // Parse JSON → serde // ----------------------------- let result: ResearchResult = serde_json::from_str(&content)?;
println!("Topic: {}", result.topic); println!("Summary: {}", result.summary); println!("Confidence: {:.0}%", result.confidence * 100.0); println!("Needs more research: {}", result.needs_more_research); println!("Sources: {:?}", result.sources);
Ok(())}Function Calling: How Agents Use Tools
An agent becomes much more useful when it can use tools.
But the model does not literally execute your code. Instead, it produces a structured request saying, in effect:
- which tool it wants to use
- what arguments it wants to pass
Then your application executes the tool and sends the result back into the conversation.
That mechanism is called function calling or tool calling.
This is one of the core primitives of modern agent systems because it lets the model move beyond text generation into interaction with the outside world.
Why tool calling matters
Tool calling lets the model:
- search for fresh information
- run calculations
- query APIs
- inspect databases
- execute code
- interact with files
- perform workflow steps
Without tool use, the model is limited to what is already in its parameters and prompt.
With tool use, it becomes part of a larger decision-and-action system.
Tool Schemas: Teaching the Model What Tools Exist
A tool is only as usable as its schema.
The model needs to know:
- the tool name
- what the tool does
- when it should be used
- what arguments it expects
- what those arguments mean
That information is defined in the tool schema.
We have already seen tool calling in our previous examples. Let’s take a look at them again.
Prompting, Tools, and Context Work Together
By this point, the three-way relationship should be clear:
- Prompting defines how the model should behave
- Tools define what the model is allowed to do
- Context management defines what the model knows right now
These are not separate topics in production. They are tightly coupled.
A badly scoped prompt can cause bad tool use. A weak tool schema can cause wrong arguments. A noisy context can cause confused reasoning. A missing memory snippet can make the model act like it “forgot” something important.
That is why agent design is mostly about getting the interfaces between these pieces right.
The Practical Capability Stack for LLM-Based Agents
At this point, you can see that LLMs are not standalone systems. They are components inside a larger control loop.
1. Text enters the system → tokenized into tokens
2. Relevant knowledge is found → often through embeddings and retrieval
3. A prompt is built → system instructions + user input + state + retrieved context
4. The model responds → either with text, structured output, or a tool call
5. Your application executes the result → parse schema, run tool, update state, manage memory
6. The loop continues → until the task is completeThis is the real bridge between “LLM basics” and “agent primitives.” One is the mental model; the other is the implementation surface.
Wrapping Up
To build reliable agents, you need more than a vague idea of how LLMs work.
You need to understand:
- tokenization, because tokens are the real unit of cost and context
- embeddings, because retrieval and memory systems depend on semantic similarity
- context windows, because the model can only reason over what is currently present
- context management, because deciding what belongs in the prompt is your job
- working memory vs long context, because not all memory should be active at once
- prompt engineering, because prompts shape behavior
- system prompting with PAG, because production agents need identity, permissions, and boundaries
- structured outputs, because agents need reliable machine-readable state
- function calling and tool schemas, because agents act through tools, not just text
That is the foundation of modern LLM-based agent engineering.
LLMs are not magic.
They are constrained systems with:
- limited context
- probabilistic behavior
- structured interfaces
The difference between a demo and a production agent is how well you design around those constraints.
In the next article, you can go one step deeper into how these primitives combine inside a full agent runtime — planner, executor, memory, state, and feedback loop.
→ Next up: Internal Agent Architecture