How Large Language Models Work
Table of Contents
- Overview
- The Four-Stage Pipeline
- Complete Flow Example
- Key Parameters That Affect Output
- Model Architecture Components
- Training vs Inference
- Limitations and Considerations
- Optimization Techniques
- How AI Agents Use LLMs
- Basic Agent Architecture
- The Agent Loop (ReAct Pattern)
- Key Components of an Agent
- How Agents Extend LLM Capabilities
- Agent Execution Flow
- Types of Agent Architectures
- Tool Calling Formats
- Agent Memory Systems
- Error Handling and Retries
- Agent vs Pure LLM
- Real-World Agent Example
- Best Practices for Agent Design
- Agent Limitations
- Summary
Overview
This guide explains the internal process of how a Large Language Model (LLM) transforms your text prompt into a coherent answer.
The Four-Stage Pipeline
1. Prompt Text → Tokenization
What happens: - Your input text is broken down into smaller units called "tokens" - Tokens can be words, subwords, or even individual characters - Each token is converted to a numerical ID from the model's vocabulary
Example:
Input: "How do neural networks learn?"
Tokens: ["How", " do", " neural", " networks", " learn", "?"]
Token IDs: [2437, 466, 17019, 7686, 2193, 30]
Key concepts:
- Vocabulary size: typically 32k-100k+ tokens
- Special tokens: <start>, <end>, <pad> for structure
- Subword tokenization: handles rare words by breaking them into parts
- Token limit: models have maximum context windows (e.g., 4k, 8k, 128k tokens)
2. Token Processing → Embeddings
What happens: - Each token ID is converted into a high-dimensional vector (embedding) - Embeddings capture semantic meaning in numerical space - Position encodings are added to preserve word order
Example:
Token "neural" → [0.23, -0.45, 0.67, ..., 0.12] (768 dimensions)
Token "network" → [0.19, -0.41, 0.71, ..., 0.09] (768 dimensions)
Key concepts: - Embedding dimension: typically 768-12,288 dimensions - Similar words have similar vector representations - Position matters: "dog bites man" ≠ "man bites dog"
3. Reasoning → Transformer Layers
What happens: - Embeddings flow through multiple transformer layers (12-96+ layers) - Each layer performs self-attention and feed-forward operations - The model identifies patterns, relationships, and context
Self-Attention Mechanism:
For each token:
1. Look at all other tokens in the context
2. Calculate relevance scores (attention weights)
3. Combine information from relevant tokens
4. Update the token's representation
Example attention pattern:
Input: "The cat sat on the mat because it was comfortable"
Token "it" attends strongly to → "mat" (or "cat")
Layer-by-layer processing: - Early layers: syntax, grammar, basic patterns - Middle layers: semantic relationships, entity recognition - Late layers: abstract reasoning, task-specific logic
Key concepts: - Multi-head attention: parallel attention mechanisms (8-96 heads) - Residual connections: preserve information across layers - Layer normalization: stabilize training and inference - Feed-forward networks: non-linear transformations
4. Text Answer Building → Decoding
What happens: - The final layer outputs probability distributions over vocabulary - The model selects the next token based on sampling strategy - This process repeats autoregressively until completion
Decoding strategies:
Greedy decoding:
Always pick the highest probability token
→ Deterministic but sometimes repetitive
Temperature sampling:
temperature = 0.0 → deterministic (always most likely)
temperature = 0.7 → balanced creativity
temperature = 1.5 → very creative/random
Top-k sampling:
Consider only the k most likely tokens (e.g., k=40)
Sample from this restricted set
Top-p (nucleus) sampling:
Consider tokens until cumulative probability reaches p (e.g., p=0.9)
More dynamic than top-k
Example generation:
Prompt: "The capital of France is"
Step 1: Model outputs → " Paris" (95% probability)
Step 2: Model outputs → "," (60% probability)
Step 3: Model outputs → " which" (45% probability)
...continues until <end> token or max length
Complete Flow Example
Input Prompt:
"Explain photosynthesis in simple terms"
Step-by-step process:
1. Tokenization:
["Explain", " photo", "synthesis", " in", " simple", " terms"]
→ [8849, 5052, 48935, 287, 2829, 2846]
2. Embedding:
Each token → 768-dimensional vector
+ positional encoding (token 0, 1, 2, ...)
3. Transformer processing (simplified):
Layer 1: Recognizes "Explain" is a request
Layer 5: Understands "photosynthesis" is a biological process
Layer 10: Connects "simple terms" → need for accessible explanation
Layer 15: Activates knowledge about plants, sunlight, energy
Layer 20: Formulates explanation structure
4. Generation:
Token 1: "Photo" (start of answer)
Token 2: "synthesis"
Token 3: " is"
Token 4: " the"
Token 5: " process"
...
(continues until complete answer)
Key Parameters That Affect Output
Temperature (0.0 - 2.0)
- Controls randomness in token selection
- Lower = more focused and deterministic
- Higher = more creative and diverse
Top-p / Top-k
- Limits the token selection pool
- Prevents very unlikely tokens from being chosen
- Balances coherence and creativity
Max tokens
- Maximum length of generated response
- Prevents infinite generation
- Typical values: 256, 512, 2048, 4096
Frequency penalty
- Reduces repetition of tokens
- Positive values discourage repeated words
- Range: -2.0 to 2.0
Presence penalty
- Encourages topic diversity
- Positive values encourage new topics
- Range: -2.0 to 2.0
Model Architecture Components
Core elements:
- Token embeddings: Convert IDs to vectors
- Position embeddings: Encode sequence order
- Attention layers: Identify relationships between tokens
- Feed-forward layers: Transform representations
- Layer normalization: Stabilize activations
- Output projection: Convert to vocabulary probabilities
Model sizes:
- Small: 125M-1B parameters (fast, less capable)
- Medium: 7B-13B parameters (balanced)
- Large: 30B-70B parameters (very capable)
- Extra large: 175B-1T+ parameters (most capable, slower)
Training vs Inference
Training (how models learn):
- Process billions of text examples
- Predict next token, compare to actual
- Adjust weights to minimize prediction error
- Takes weeks/months on massive GPU clusters
Inference (how models respond):
- Use frozen (fixed) weights
- Process your prompt through the network
- Generate tokens one at a time
- Takes seconds to minutes depending on length
Limitations and Considerations
Context window:
- Models can only "see" a limited number of tokens
- Older information may be forgotten in long conversations
- Context window sizes: 4k, 8k, 32k, 128k+ tokens
Knowledge cutoff:
- Models only know information from their training data
- No real-time information unless connected to external tools
- May have outdated information
Hallucinations:
- Models can generate plausible but incorrect information
- Confidence doesn't equal accuracy
- Always verify critical information
Reasoning limitations:
- Pattern matching, not true understanding
- Can struggle with complex logic or math
- May miss subtle context or nuance
Optimization Techniques
Quantization:
- Reduce precision of weights (32-bit → 8-bit or 4-bit)
- Smaller memory footprint, faster inference
- Slight quality trade-off
Caching:
- Store computed values for repeated prompts
- Speeds up multi-turn conversations
- Key-value cache for attention mechanism
Batching:
- Process multiple requests simultaneously
- Better GPU utilization
- Higher throughput
How AI Agents Use LLMs
An AI agent is a system that uses an LLM as its "brain" but extends it with additional capabilities like tool use, memory, and planning. Here's how agents work:
Basic Agent Architecture
User Request
↓
Agent System (orchestration layer)
↓
┌─────────────────────────────────────┐
│ LLM (reasoning engine) │
│ - Understands request │
│ - Plans actions │
│ - Decides what tools to use │
└─────────────────────────────────────┘
↓
Tool Execution (external actions)
↓
Results fed back to LLM
↓
Final Response to User
The Agent Loop (ReAct Pattern)
Agents typically follow a Thought → Action → Observation cycle:
Example: "What's the weather in Paris and convert the temperature to Celsius?"
Iteration 1:
Thought: "I need to get the current weather in Paris"
Action: call_tool("get_weather", {"city": "Paris"})
Observation: "Temperature: 72°F, Sunny"
Iteration 2:
Thought: "I have the temperature in Fahrenheit, need to convert to Celsius"
Action: call_tool("convert_temperature", {"value": 72, "from": "F", "to": "C"})
Observation: "22.2°C"
Iteration 3:
Thought: "I have all the information needed"
Action: respond_to_user
Response: "The weather in Paris is sunny with a temperature of 22.2°C (72°F)."
Key Components of an Agent
1. System Prompt (Instructions)
You are an AI assistant with access to tools.
When you need information, use the available tools.
Always explain your reasoning before taking action.
Available tools:
- search_web(query): Search the internet
- read_file(path): Read a file
- execute_code(code): Run Python code
2. Tool Definitions
{
"name": "search_web",
"description": "Search the internet for current information",
"parameters": {
"query": "string - the search query"
}
}
3. Conversation Memory
[Previous messages]
User: "Find the population of Tokyo"
Assistant: [used search_web] "Tokyo has 14 million people"
User: "What about Paris?"
Assistant: [remembers context] [uses search_web] "Paris has 2.1 million people"
How Agents Extend LLM Capabilities
| Limitation | How Agents Solve It |
|---|---|
| No real-time data | Connect to APIs, databases, search engines |
| Can't perform actions | Execute code, modify files, send emails |
| Limited memory | Store conversation history, use vector databases |
| No access to private data | Read from user's files, databases, documents |
| Can't verify facts | Use tools to check information, run calculations |
Agent Execution Flow
Step 1: Prompt Construction
System Instructions
+
Tool Definitions
+
Conversation History
+
User Request
→ Sent to LLM
Step 2: LLM Response Parsing
LLM Output: "I need to search for information.
<tool_call>search_web("Paris weather")</tool_call>"
Agent parses this and extracts:
- Tool name: search_web
- Parameters: {"query": "Paris weather"}
Step 3: Tool Execution
Agent executes: search_web("Paris weather")
Result: "Current weather in Paris: 22°C, Sunny"
Step 4: Result Injection
Agent adds result to context:
"Tool result: Current weather in Paris: 22°C, Sunny"
→ Sends back to LLM for next decision
Step 5: Iteration or Completion
LLM decides:
- Need more tools? → Repeat cycle
- Have enough info? → Generate final response
Types of Agent Architectures
1. ReAct (Reasoning + Acting) - LLM reasons about what to do - Executes actions via tools - Observes results and continues
2. Plan-and-Execute - LLM creates a complete plan first - Agent executes all steps - Less flexible but more predictable
3. Autonomous Agents - Given high-level goals - Continuously run until goal achieved - Can spawn sub-tasks
4. Multi-Agent Systems - Multiple specialized agents - Each has different tools/expertise - Collaborate to solve complex tasks
Tool Calling Formats
Function Calling (Structured)
{
"tool": "get_weather",
"arguments": {
"city": "Paris",
"units": "celsius"
}
}
Natural Language (Parsed)
I'll use the weather tool to check Paris.
ACTION: get_weather(city="Paris", units="celsius")
XML Format
<tool_call>
<name>get_weather</name>
<parameters>
<city>Paris</city>
<units>celsius</units>
</parameters>
</tool_call>
Agent Memory Systems
Short-term Memory: - Current conversation context - Recent tool results - Stored in prompt/context window
Long-term Memory: - Vector database for semantic search - Key-value stores for facts - Retrieved when relevant
Example:
User: "Remember my favorite color is blue"
→ Agent stores: {"user_preference": "favorite_color", "value": "blue"}
Later...
User: "What color should I paint my room?"
→ Agent retrieves: "favorite_color = blue"
→ Response: "Since your favorite color is blue, you might consider..."
Error Handling and Retries
Agents handle failures that LLMs alone cannot:
Attempt 1: call_tool("search", {"query": ""})
Error: "Query cannot be empty"
Agent injects error into context:
"Tool error: Query cannot be empty. Please provide a valid query."
LLM adjusts:
Attempt 2: call_tool("search", {"query": "Paris weather"})
Success: Returns weather data
Agent vs Pure LLM
| Feature | Pure LLM | AI Agent |
|---|---|---|
| Knowledge | Training data only | Can access real-time data |
| Actions | Generate text only | Execute code, API calls, file operations |
| Memory | Context window only | Persistent storage, retrieval |
| Accuracy | May hallucinate | Can verify with tools |
| Autonomy | Single response | Multi-step task completion |
| Cost | One API call | Multiple API calls (LLM + tools) |
Real-World Agent Example
Task: "Analyze the sales data from last month and create a report"
Step 1: LLM plans
Thought: "I need to read the sales data file"
Action: read_file("sales_2024_12.csv")
Step 2: LLM analyzes
Observation: [CSV data received]
Thought: "I should calculate total sales and trends"
Action: execute_code("import pandas as pd; df = pd.read_csv(...)")
Step 3: LLM generates insights
Observation: [Analysis results]
Thought: "Now I'll create a formatted report"
Action: write_file("sales_report.md", content)
Step 4: LLM confirms
Observation: [File created successfully]
Response: "I've analyzed the sales data and created a report..."
Best Practices for Agent Design
1. Clear tool descriptions - LLM needs to understand when to use each tool - Include examples in tool documentation
2. Limit tool complexity - Simple, focused tools work better - Break complex operations into smaller tools
3. Provide feedback loops - Always return tool results to the LLM - Let LLM verify and adjust
4. Set iteration limits - Prevent infinite loops - Typical limit: 5-10 iterations
5. Use structured outputs - JSON or XML for tool calls - Easier to parse reliably
Agent Limitations
Cost: - Multiple LLM calls per task - Can be expensive for complex workflows
Latency: - Each tool call adds delay - Multi-step tasks take longer
Reliability: - More complex = more failure points - LLM might choose wrong tools
Unpredictability: - Agent behavior can vary - Same task might use different approaches
Summary
The LLM pipeline:
Text Prompt
↓
Tokenization (text → token IDs)
↓
Embedding (IDs → vectors)
↓
Transformer Layers (reasoning & pattern matching)
↓
Output Projection (vectors → probabilities)
↓
Decoding (probabilities → tokens)
↓
Detokenization (tokens → text)
↓
Generated Answer
Each stage is deterministic given the same inputs and parameters, but sampling strategies introduce controlled randomness to create diverse, natural responses.
Agents extend this pipeline by wrapping the LLM in an orchestration layer that enables tool use, memory, and multi-step reasoning, transforming a text generator into an autonomous problem-solver.