An interactive, hands-on guide from fundamentals to deep RL — with live demos you can run in your browser.
What you will learn
How agents learn by trial and error, the math behind value functions and Q-learning, exploration strategies, policy gradient methods, and where modern deep RL is headed.
🤖
What is RL?
Agents, environments, states, actions, rewards. The core feedback loop.
Theory
🔗
Markov Decision Processes
The mathematical framework that formalizes sequential decision making.
Theory
📈
Value Functions & Bellman
How to estimate long-term returns. Interactive value iteration demo.
Demo
🤖
Q-Learning
Watch a real agent learn to navigate a grid world from scratch.
Live Demo
🎲
Exploration vs Exploitation
The multi-armed bandit problem. Pull the levers, see estimates converge.
Demo
📊
Policy Gradient Methods
Optimize the policy directly. REINFORCE, actor-critic, and intuition.
Theory
Chapter 1
What is Reinforcement Learning?
Learning by interacting with the world — no labels required.
In Reinforcement Learning (RL), an agent learns to make decisions by interacting with an environment. Unlike supervised learning, there are no labeled examples — only rewards and punishments that guide behavior over time.
Think of training a dog: you don't explain what "sit" means in words. You reward good behavior and ignore bad behavior, and the dog gradually learns. RL works the same way — at massive scale and speed.
The Agent-Environment Loop
Click the labels below to explore each concept.
Agent
Action aₚ→
↓ ↑
State sₚ₊₁, Reward rₚ₊₁←
Environment
s
State
A snapshot of the environment at a given time. Could be a game frame, a robot's joint angles, or a portfolio value. The agent bases its decisions on the current state.
a
Action
A choice the agent makes. Could be "move left", "buy shares", or "apply force". The set of all possible actions is called the action space.
r
Reward
A scalar signal the agent receives after each action. Positive rewards encourage behavior; negative rewards discourage it. The agent's goal is to maximize total reward over time.
Key Concepts
🎯
Policy π
The agent's strategy — a mapping from states to actions. A deterministic policy says "in state s, do action a". A stochastic policy gives a probability distribution over actions.
📈
Return G
The total accumulated reward from time t onwards: Gₚ = rₚ₊₁ + γrₚ₊₂ + γ²rₚ₊₃ + ... where γ (gamma) is the discount factor, making future rewards worth less.
🏠
Episode
A complete sequence from start to terminal state. In a game, one episode = one game. The agent resets and tries again, hopefully doing better each time.
📄
Discount Factor γ
A value in [0, 1] that controls how much the agent cares about future rewards. γ = 0: only immediate reward matters. γ = 0.99: agent is almost fully far-sighted.
Real-World Examples
🎮 Video Games DeepMind's AlphaGo, OpenAI's Dota 2 bot — agents trained by playing millions of games.
🤖 Robotics Robot arms learning to grasp objects, legged robots learning to walk — purely through trial and error.
💡 Recommendations YouTube, Netflix — the recommendation engine is trained to maximize engagement (reward) over sessions.
Chapter 2
Markov Decision Processes
The mathematical language of sequential decision making.
A Markov Decision Process (MDP) is the formal framework underlying most RL problems. It gives us precise language to describe the agent, environment, and their interaction.
Definition
An MDP is a 5-tuple (S, A, P, R, γ):
🌏
S — State Space
The set of all possible situations the agent can be in. Can be finite (grid world) or continuous (robot joint angles).
▶
A — Action Space
All possible actions available to the agent. Can be discrete (left/right/up/down) or continuous (torque values).
🔄
P — Transition Model
P(s'|s, a) — probability of landing in state s' after taking action a in state s. Captures environment dynamics.
🎉
R — Reward Function
R(s, a) — the immediate reward received after taking action a in state s. Encodes what the agent should optimize.
The Markov Property
Key Insight
The Markov property states that the future depends only on the current state, not on the history of how we got there. This is a crucial simplifying assumption that makes RL tractable.
The next state depends only on the current state and action — not the full history. This is why we can build compact algorithms.
A Simple MDP Example
Consider a 3-room apartment. The agent (robot) moves between rooms to reach the kitchen (goal).
Why MDP Formulation Matters
Once we have an MDP, we can apply dynamic programming, Monte Carlo, or temporal difference methods to find the optimal policy — the mapping from states to actions that maximizes total reward.
MDP vs Real World
Real problems rarely have a known transition model P(s'|s,a). When we know the model, we can use planning (Dynamic Programming). When we don't, we use model-free RL like Q-learning — the agent learns from experience without needing to know how the world works.
Chapter 3
Value Functions & Bellman Equations
How do we quantify how good a state is? Enter value functions.
The State-Value Function V(s)
The value of a state is the expected total return starting from that state, following policy π:
Vπ(s) = Eπ[ Gₚ | sₚ = s ] = Eπ[ ∑ₚ₀₀∞ γᵈ rₚ₊ₚ₊₁ | sₚ = s ]
Intuitively: if the agent is in state s and plays policy π from now on, how much total reward can it expect?
The Action-Value Function Q(s, a)
The Q-function (quality function) tells us how good it is to take action a in state s:
Qπ(s, a) = Eπ[ Gₚ | sₚ = s, aₚ = a ]
Q is more useful for learning because the optimal policy follows trivially: π*(s) = argmaxₐ Q*(s, a)
The Bellman Equation
The key insight of RL: a value can be expressed recursively in terms of the next state's value:
This recursive structure allows us to compute values by bootstrapping: use current estimates to update estimates. This is the core of Q-learning and TD learning.
Interactive: Value Iteration
A 4x4 grid world. Click cells to cycle reward types. Then press Iterate to run one Bellman sweep. Watch values propagate from the goal backwards.
γ = 0.9
Sweeps: 0Max |ΔV|: —
■ Goal (+10)■ Trap (-5)■ Wall (blocked)■ Empty (0)Click cells to change type
Chapter 4
Q-Learning: A Live Demo
Watch an agent learn to navigate a grid world from scratch, with no knowledge of the environment.
The Q-Learning Algorithm
Q-learning is a model-free, off-policy TD algorithm. It directly learns the optimal Q-function without needing to know the environment dynamics.
Q(s, a) ← Q(s, a) + α [ r + γ maxₐ₈ Q(s', a') − Q(s, a) ]
// Q-Learning AlgorithmInitialize Q(s, a) = 0 for all s, a
for each episode:
s ← start state
while s is not terminal:
a ← ε-greedy(Q, s) // explore or exploit r, s' ← env.step(s, a) // take action, observe Q(s,a) ← Q(s,a) + α[r + γ·max Q(s',·) - Q(s,a)]
s ← s'
decay ε // less exploration over time
Live Grid World Demo
The agent (white circle) starts top-left, goal is bottom-right (gold). Walls block movement. Cell color = max Q-value (darker = less known/good, brighter = more promising).
The cell colors lighten as Q-values grow near the goal
Policy arrows eventually point toward the goal from everywhere
As ε decays, the agent stops exploring and follows the learned policy
The agent finds optimal (or near-optimal) paths through the maze
Chapter 5
Exploration vs Exploitation
The fundamental dilemma: should the agent try new things or stick with what works?
Exploitation means using current knowledge to pick the best-known action. Exploration means trying less-known actions to gather information. Too much exploitation: you get stuck at local optima. Too much exploration: you never use what you know.
The Multi-Armed Bandit Problem
Imagine 5 slot machines (bandits), each with a different but unknown expected payout. You get 50 pulls total. Your goal: maximize total reward.
This is the exploration-exploitation dilemma in its purest form — no states, just action selection.
Interactive Bandit Demo
Click the slot machines to pull them. The true reward is hidden — you must discover which machine is best through exploration!
Pull a slot machine to start...
Total pulls: 0Total reward: 0.00Best arm: ?
Common Exploration Strategies
🎲
ε-Greedy
With probability ε, pick a random action (explore). Otherwise, pick the best-known action (exploit). Simple and widely used. Decay ε over time as you learn more.
📊
UCB (Upper Confidence Bound)
Choose action with highest Q + c√(ln(t)/N(a)). This naturally explores less-tried actions more, giving "optimism in the face of uncertainty".
🌡
Boltzmann / Softmax
Sample actions proportional to exp(Q(a)/τ). Temperature τ controls randomness. High τ = uniform random, low τ = greedy. Smooth interpolation between the extremes.
👉
Thompson Sampling
Maintain a probability distribution over each arm's true value. Sample from each distribution and pick the highest sample. Bayesian and provably efficient.
UCB: Aₚ = argmaxₐ[ Q(a) + c √(ln t / N(a)) ]
where N(a) is the number of times action a has been tried and t is the total number of steps.
Chapter 6
Policy Gradient Methods
Instead of learning a value function, why not directly optimize the policy?
The Core Idea
Q-learning learns a value function and derives a policy indirectly. Policy gradient methods parameterize the policy directly as πθ(a|s) and optimize parameters θ using gradient ascent on expected return.
J(θ) = Eπθ[ Gₚ ] = Eπθ[ ∑ₚ γᵈ rₚ₊₁ ]
θ ← θ + α ∇θ J(θ) // gradient ASCENT to maximize
The Policy Gradient Theorem
Computing the gradient of J directly is hard — the environment dynamics P are unknown. The policy gradient theorem gives us a tractable form:
∇θ J(θ) = Eπθ[ ∇θ log πθ(a|s) · Qπ(s, a) ]
This is elegant: the gradient is the expected product of the log-probability gradient and the Q-value. We can estimate this from experience without knowing P.
REINFORCE Algorithm
The simplest policy gradient method — uses the full return Gₚ as an estimate of Q:
// REINFORCE (Williams, 1992)Initialize policy πθ (e.g. neural network with params θ)
for each episode:
Sample trajectory τ = (s₀,a₀,r₁, s₁,a₁,r₂, ..., sₜ) from πθ
for each step t:
Gₚ ← ∑ γᵉ⁻ᵈ rᵉ₊₁ // return from t θ ← θ + α γᵈ Gₚ ∇θ log πθ(aₚ|sₚ) // update
Why Policy Gradients?
✅
Works with Continuous Actions
Q-learning requires a max over actions — impossible in continuous spaces. Policy gradients handle this naturally by parameterizing πθ(a|s) as a Gaussian distribution.
🔄
Stochastic Policies
Policy gradients naturally represent stochastic policies. This is useful when the optimal policy is non-deterministic (e.g., bluffing in poker requires randomness).
🔥
High Variance
REINFORCE has high variance — return estimates are noisy. Solutions: subtract a baseline, use actor-critic methods, or advantage functions A(s,a) = Q(s,a) - V(s).
🧠
Actor-Critic
Combines value functions (critic estimates V(s)) with policy gradients (actor updates πθ). The critic reduces variance; the actor gets direct policy improvement.
Value-Based vs Policy-Based
Q-Learning (Value-Based)
Learn Q(s,a), derive policy
Deterministic policy (greedy)
Discrete action spaces
Often more sample efficient
DQN, Rainbow, C51
Policy Gradients (Policy-Based)
Directly optimize πθ(a|s)
Stochastic policies
Continuous action spaces
Higher variance, but scalable
PPO, A3C, SAC, TRPO
Chapter 7
Deep RL & Beyond
Combining deep neural networks with RL to tackle complex, high-dimensional problems.
The Problem with Tabular RL
Q-learning stores a table of Q(s, a) values. This works for small discrete state spaces, but fails in complex environments. A single Atari frame has 25684×84×4 possible states — a table won't fit in the observable universe.
Deep RL replaces the Q-table with a neural network: Q(s, a; θ). The network takes a state (e.g., raw pixels) and outputs Q-values for all actions.
Key Algorithms Timeline
2013
DQN (DeepMind) — First deep RL system to master Atari games from raw pixels. Key innovations: experience replay, target networks to stabilize training.
2015
A3C (Asynchronous Advantage Actor-Critic) — Multiple agents run in parallel, asynchronously updating shared network weights. Faster and more stable than DQN.
2016
AlphaGo — Defeated world Go champion. Combined supervised learning, policy gradients, and Monte Carlo Tree Search.
2017
PPO (Proximal Policy Optimization) — OpenAI's workhorse algorithm. Clips the policy update to prevent destructively large steps. Simple, robust, widely used today.
2018
SAC (Soft Actor-Critic) — Maximum entropy RL. Maximizes reward AND policy entropy (randomness), leading to better exploration and robustness.
2019+
AlphaStar, OpenAI Five, MuZero — Superhuman performance in StarCraft, Dota 2. MuZero learns its own world model without being told the rules.
2022+
RLHF (RL from Human Feedback) — The technique behind ChatGPT/Claude. Train a reward model from human preferences, then use PPO to fine-tune the language model.
DQN: The Core Innovation
// Deep Q-Network (DQN) — simplifiedInitialize Q-network Q(s,a;θ) and target net Q(s,a;θ⁻)
Initialize replay buffer D
for each step:
a = ε-greedy(Q(s,·;θ))
r, s' = env.step(a)
D.push((s, a, r, s')) // store experience
(s,a,r,s') = D.sample_batch() // replay for stability y = r + γ maxₐ Q(s',a;θ⁻) // target (fixed params) θ ← θ - α ∇θ (y - Q(s,a;θ))² // gradient descent
periodically: θ⁻ ← θ // update target net
Where to Go Next
📚
Sutton & Barto
"Reinforcement Learning: An Introduction" — the canonical textbook, freely available online. Complete mathematical treatment of everything covered here.
🏭
Gymnasium (OpenAI Gym)
The standard Python library for RL environments. CartPole, MountainCar, Atari, MuJoCo. Start here to run your own Q-learning and PPO experiments.
💻
Stable-Baselines3
High-quality PyTorch implementations of PPO, SAC, DQN, A2C. The fastest way to train RL agents without writing algorithms from scratch.
🚀
Deep Mind / OpenAI Papers
Read the original DQN, PPO, AlphaGo papers on arXiv. The field moves fast — following new papers from top labs is how to stay current.
You have completed RL Tutor!
You now understand the full stack: from MDPs and value functions, through Q-learning and policy gradients, to modern deep RL. The best way to solidify this knowledge is to implement: start with tabular Q-learning on CartPole, then try DQN on Atari.