How Large Language Models Work

Overview
The Four-Stage Pipeline
Complete Flow Example
Key Parameters That Affect Output
Model Architecture Components
Training vs Inference
Limitations and Considerations
Optimization Techniques
How AI Agents Use LLMs
Summary

Overview

This guide explains the internal process of how a Large Language Model (LLM) transforms your text prompt into a coherent answer.

The Four-Stage Pipeline

1. Prompt Text → Tokenization

What happens: - Your input text is broken down into smaller units called "tokens" - Tokens can be words, subwords, or even individual characters - Each token is converted to a numerical ID from the model's vocabulary

Example:

Input: "How do neural networks learn?"
Tokens: ["How", " do", " neural", " networks", " learn", "?"]
Token IDs: [2437, 466, 17019, 7686, 2193, 30]

Key concepts: - Vocabulary size: typically 32k-100k+ tokens - Special tokens: <start>, <end>, <pad> for structure - Subword tokenization: handles rare words by breaking them into parts - Token limit: models have maximum context windows (e.g., 4k, 8k, 128k tokens)

2. Token Processing → Embeddings

What happens: - Each token ID is converted into a high-dimensional vector (embedding) - Embeddings capture semantic meaning in numerical space - Position encodings are added to preserve word order

Example:

Token "neural" → [0.23, -0.45, 0.67, ..., 0.12] (768 dimensions)
Token "network" → [0.19, -0.41, 0.71, ..., 0.09] (768 dimensions)

Key concepts: - Embedding dimension: typically 768-12,288 dimensions - Similar words have similar vector representations - Position matters: "dog bites man" ≠ "man bites dog"

3. Reasoning → Transformer Layers

What happens: - Embeddings flow through multiple transformer layers (12-96+ layers) - Each layer performs self-attention and feed-forward operations - The model identifies patterns, relationships, and context

Self-Attention Mechanism:

For each token:
1. Look at all other tokens in the context
2. Calculate relevance scores (attention weights)
3. Combine information from relevant tokens
4. Update the token's representation

Example attention pattern:

Input: "The cat sat on the mat because it was comfortable"
Token "it" attends strongly to → "mat" (or "cat")

Layer-by-layer processing: - Early layers: syntax, grammar, basic patterns - Middle layers: semantic relationships, entity recognition - Late layers: abstract reasoning, task-specific logic

Key concepts: - Multi-head attention: parallel attention mechanisms (8-96 heads) - Residual connections: preserve information across layers - Layer normalization: stabilize training and inference - Feed-forward networks: non-linear transformations

4. Text Answer Building → Decoding

What happens: - The final layer outputs probability distributions over vocabulary - The model selects the next token based on sampling strategy - This process repeats autoregressively until completion

Decoding strategies:

Greedy decoding:

Always pick the highest probability token
→ Deterministic but sometimes repetitive

Temperature sampling:

temperature = 0.0  → deterministic (always most likely)
temperature = 0.7  → balanced creativity
temperature = 1.5  → very creative/random

Top-k sampling:

Consider only the k most likely tokens (e.g., k=40)
Sample from this restricted set

Top-p (nucleus) sampling:

Consider tokens until cumulative probability reaches p (e.g., p=0.9)
More dynamic than top-k

Example generation:

Prompt: "The capital of France is"
Step 1: Model outputs → " Paris" (95% probability)
Step 2: Model outputs → "," (60% probability)
Step 3: Model outputs → " which" (45% probability)
...continues until <end> token or max length

Complete Flow Example

Input Prompt:

"Explain photosynthesis in simple terms"

Step-by-step process:

1. Tokenization:

["Explain", " photo", "synthesis", " in", " simple", " terms"]
→ [8849, 5052, 48935, 287, 2829, 2846]

2. Embedding:

Each token → 768-dimensional vector
+ positional encoding (token 0, 1, 2, ...)

3. Transformer processing (simplified):

Layer 1:  Recognizes "Explain" is a request
Layer 5:  Understands "photosynthesis" is a biological process
Layer 10: Connects "simple terms" → need for accessible explanation
Layer 15: Activates knowledge about plants, sunlight, energy
Layer 20: Formulates explanation structure

4. Generation:

Token 1: "Photo" (start of answer)
Token 2: "synthesis"
Token 3: " is"
Token 4: " the"
Token 5: " process"
...
(continues until complete answer)

Key Parameters That Affect Output

Temperature (0.0 - 2.0)

Controls randomness in token selection
Lower = more focused and deterministic
Higher = more creative and diverse

Top-p / Top-k

Limits the token selection pool
Prevents very unlikely tokens from being chosen
Balances coherence and creativity

Max tokens

Maximum length of generated response
Prevents infinite generation
Typical values: 256, 512, 2048, 4096

Frequency penalty

Reduces repetition of tokens
Positive values discourage repeated words
Range: -2.0 to 2.0

Presence penalty

Encourages topic diversity
Positive values encourage new topics
Range: -2.0 to 2.0

Model Architecture Components

Core elements:

Token embeddings: Convert IDs to vectors
Position embeddings: Encode sequence order
Attention layers: Identify relationships between tokens
Feed-forward layers: Transform representations
Layer normalization: Stabilize activations
Output projection: Convert to vocabulary probabilities

Model sizes:

Small: 125M-1B parameters (fast, less capable)
Medium: 7B-13B parameters (balanced)
Large: 30B-70B parameters (very capable)
Extra large: 175B-1T+ parameters (most capable, slower)

Training vs Inference

Training (how models learn):

Process billions of text examples
Predict next token, compare to actual
Adjust weights to minimize prediction error
Takes weeks/months on massive GPU clusters

Inference (how models respond):

Use frozen (fixed) weights
Process your prompt through the network
Generate tokens one at a time
Takes seconds to minutes depending on length

Limitations and Considerations

Context window:

Models can only "see" a limited number of tokens
Older information may be forgotten in long conversations
Context window sizes: 4k, 8k, 32k, 128k+ tokens

Knowledge cutoff:

Models only know information from their training data
No real-time information unless connected to external tools
May have outdated information

Hallucinations:

Models can generate plausible but incorrect information
Confidence doesn't equal accuracy
Always verify critical information

Reasoning limitations:

Pattern matching, not true understanding
Can struggle with complex logic or math
May miss subtle context or nuance

Optimization Techniques

Quantization:

Reduce precision of weights (32-bit → 8-bit or 4-bit)
Smaller memory footprint, faster inference
Slight quality trade-off

Caching:

Store computed values for repeated prompts
Speeds up multi-turn conversations
Key-value cache for attention mechanism

Batching:

Process multiple requests simultaneously
Better GPU utilization
Higher throughput

How AI Agents Use LLMs

An AI agent is a system that uses an LLM as its "brain" but extends it with additional capabilities like tool use, memory, and planning. Here's how agents work:

Basic Agent Architecture

User Request
    ↓
Agent System (orchestration layer)
    ↓
┌─────────────────────────────────────┐
│  LLM (reasoning engine)             │
│  - Understands request              │
│  - Plans actions                    │
│  - Decides what tools to use        │
└─────────────────────────────────────┘
    ↓
Tool Execution (external actions)
    ↓
Results fed back to LLM
    ↓
Final Response to User

The Agent Loop (ReAct Pattern)

Agents typically follow a Thought → Action → Observation cycle:

Example: "What's the weather in Paris and convert the temperature to Celsius?"

Iteration 1:
  Thought: "I need to get the current weather in Paris"
  Action: call_tool("get_weather", {"city": "Paris"})
  Observation: "Temperature: 72°F, Sunny"

Iteration 2:
  Thought: "I have the temperature in Fahrenheit, need to convert to Celsius"
  Action: call_tool("convert_temperature", {"value": 72, "from": "F", "to": "C"})
  Observation: "22.2°C"

Iteration 3:
  Thought: "I have all the information needed"
  Action: respond_to_user
  Response: "The weather in Paris is sunny with a temperature of 22.2°C (72°F)."

Key Components of an Agent

1. System Prompt (Instructions)

You are an AI assistant with access to tools.
When you need information, use the available tools.
Always explain your reasoning before taking action.

Available tools:
- search_web(query): Search the internet
- read_file(path): Read a file
- execute_code(code): Run Python code

2. Tool Definitions

{
  "name": "search_web",
  "description": "Search the internet for current information",
  "parameters": {
    "query": "string - the search query"
  }
}

3. Conversation Memory

[Previous messages]
User: "Find the population of Tokyo"
Assistant: [used search_web] "Tokyo has 14 million people"
User: "What about Paris?"
Assistant: [remembers context] [uses search_web] "Paris has 2.1 million people"

How Agents Extend LLM Capabilities

Limitation	How Agents Solve It
No real-time data	Connect to APIs, databases, search engines
Can't perform actions	Execute code, modify files, send emails
Limited memory	Store conversation history, use vector databases
No access to private data	Read from user's files, databases, documents
Can't verify facts	Use tools to check information, run calculations

Agent Execution Flow

Step 1: Prompt Construction

System Instructions
+
Tool Definitions
+
Conversation History
+
User Request
→ Sent to LLM

Step 2: LLM Response Parsing

LLM Output: "I need to search for information. 
             <tool_call>search_web("Paris weather")</tool_call>"

Agent parses this and extracts:
- Tool name: search_web
- Parameters: {"query": "Paris weather"}

Step 3: Tool Execution

Agent executes: search_web("Paris weather")
Result: "Current weather in Paris: 22°C, Sunny"

Step 4: Result Injection

Agent adds result to context:
"Tool result: Current weather in Paris: 22°C, Sunny"
→ Sends back to LLM for next decision

Step 5: Iteration or Completion

LLM decides:
- Need more tools? → Repeat cycle
- Have enough info? → Generate final response

Types of Agent Architectures

1. ReAct (Reasoning + Acting) - LLM reasons about what to do - Executes actions via tools - Observes results and continues

2. Plan-and-Execute - LLM creates a complete plan first - Agent executes all steps - Less flexible but more predictable

3. Autonomous Agents - Given high-level goals - Continuously run until goal achieved - Can spawn sub-tasks

4. Multi-Agent Systems - Multiple specialized agents - Each has different tools/expertise - Collaborate to solve complex tasks

Tool Calling Formats

Function Calling (Structured)

{
  "tool": "get_weather",
  "arguments": {
    "city": "Paris",
    "units": "celsius"
  }
}

Natural Language (Parsed)

I'll use the weather tool to check Paris.
ACTION: get_weather(city="Paris", units="celsius")

XML Format

<tool_call>
  <name>get_weather</name>
  <parameters>
    <city>Paris</city>
    <units>celsius</units>
  </parameters>
</tool_call>

Agent Memory Systems

Short-term Memory: - Current conversation context - Recent tool results - Stored in prompt/context window

Long-term Memory: - Vector database for semantic search - Key-value stores for facts - Retrieved when relevant

Example:

User: "Remember my favorite color is blue"
→ Agent stores: {"user_preference": "favorite_color", "value": "blue"}

Later...
User: "What color should I paint my room?"
→ Agent retrieves: "favorite_color = blue"
→ Response: "Since your favorite color is blue, you might consider..."

Error Handling and Retries

Agents handle failures that LLMs alone cannot:

Attempt 1: call_tool("search", {"query": ""})
Error: "Query cannot be empty"

Agent injects error into context:
"Tool error: Query cannot be empty. Please provide a valid query."

LLM adjusts:
Attempt 2: call_tool("search", {"query": "Paris weather"})
Success: Returns weather data

Agent vs Pure LLM

Feature	Pure LLM	AI Agent
Knowledge	Training data only	Can access real-time data
Actions	Generate text only	Execute code, API calls, file operations
Memory	Context window only	Persistent storage, retrieval
Accuracy	May hallucinate	Can verify with tools
Autonomy	Single response	Multi-step task completion
Cost	One API call	Multiple API calls (LLM + tools)

Real-World Agent Example

Task: "Analyze the sales data from last month and create a report"

Step 1: LLM plans
  Thought: "I need to read the sales data file"
  Action: read_file("sales_2024_12.csv")

Step 2: LLM analyzes
  Observation: [CSV data received]
  Thought: "I should calculate total sales and trends"
  Action: execute_code("import pandas as pd; df = pd.read_csv(...)")

Step 3: LLM generates insights
  Observation: [Analysis results]
  Thought: "Now I'll create a formatted report"
  Action: write_file("sales_report.md", content)

Step 4: LLM confirms
  Observation: [File created successfully]
  Response: "I've analyzed the sales data and created a report..."

Best Practices for Agent Design

1. Clear tool descriptions - LLM needs to understand when to use each tool - Include examples in tool documentation

2. Limit tool complexity - Simple, focused tools work better - Break complex operations into smaller tools

3. Provide feedback loops - Always return tool results to the LLM - Let LLM verify and adjust

4. Set iteration limits - Prevent infinite loops - Typical limit: 5-10 iterations

5. Use structured outputs - JSON or XML for tool calls - Easier to parse reliably

Agent Limitations

Cost: - Multiple LLM calls per task - Can be expensive for complex workflows

Latency: - Each tool call adds delay - Multi-step tasks take longer

Reliability: - More complex = more failure points - LLM might choose wrong tools

Unpredictability: - Agent behavior can vary - Same task might use different approaches

Summary

The LLM pipeline:

Text Prompt
    ↓
Tokenization (text → token IDs)
    ↓
Embedding (IDs → vectors)
    ↓
Transformer Layers (reasoning & pattern matching)
    ↓
Output Projection (vectors → probabilities)
    ↓
Decoding (probabilities → tokens)
    ↓
Detokenization (tokens → text)
    ↓
Generated Answer

Each stage is deterministic given the same inputs and parameters, but sampling strategies introduce controlled randomness to create diverse, natural responses.

Agents extend this pipeline by wrapping the LLM in an orchestration layer that enables tool use, memory, and multi-step reasoning, transforming a text generator into an autonomous problem-solver.

How Large Language Models Work

Table of Contents

Overview

The Four-Stage Pipeline

1. Prompt Text → Tokenization

2. Token Processing → Embeddings

3. Reasoning → Transformer Layers

4. Text Answer Building → Decoding

Complete Flow Example

Input Prompt:

Step-by-step process:

Key Parameters That Affect Output

Temperature (0.0 - 2.0)

Top-p / Top-k

Max tokens

Frequency penalty

Presence penalty

Model Architecture Components

Core elements:

Model sizes:

Training vs Inference

Training (how models learn):

Inference (how models respond):

Limitations and Considerations

Context window:

Knowledge cutoff:

Hallucinations:

Reasoning limitations:

Optimization Techniques

Quantization:

Caching:

Batching:

How AI Agents Use LLMs

Basic Agent Architecture

The Agent Loop (ReAct Pattern)

Key Components of an Agent

How Agents Extend LLM Capabilities

Agent Execution Flow

Types of Agent Architectures

Tool Calling Formats

Agent Memory Systems

Error Handling and Retries

Agent vs Pure LLM

Real-World Agent Example

Best Practices for Agent Design

Agent Limitations

Summary