Getting Started

Learn Reinforcement Learning

An interactive, hands-on guide from fundamentals to deep RL — with live demos you can run in your browser.

What you will learn

How agents learn by trial and error, the math behind value functions and Q-learning, exploration strategies, policy gradient methods, and where modern deep RL is headed.

🤖

What is RL?

Agents, environments, states, actions, rewards. The core feedback loop.

Theory

🔗

Markov Decision Processes

The mathematical framework that formalizes sequential decision making.

Theory

📈

Value Functions & Bellman

How to estimate long-term returns. Interactive value iteration demo.

Demo

🤖

Q-Learning

Watch a real agent learn to navigate a grid world from scratch.

Live Demo

🎲

Exploration vs Exploitation

The multi-armed bandit problem. Pull the levers, see estimates converge.

Demo

📊

Policy Gradient Methods

Optimize the policy directly. REINFORCE, actor-critic, and intuition.

Theory

Chapter 1

What is Reinforcement Learning?

Learning by interacting with the world — no labels required.

In Reinforcement Learning (RL), an agent learns to make decisions by interacting with an environment. Unlike supervised learning, there are no labeled examples — only rewards and punishments that guide behavior over time.

Think of training a dog: you don't explain what "sit" means in words. You reward good behavior and ignore bad behavior, and the dog gradually learns. RL works the same way — at massive scale and speed.

The Agent-Environment Loop

Click the labels below to explore each concept.

Agent

Action aₚ

→

↓ ↑

State sₚ₊₁, Reward rₚ₊₁

←

Environment

s

State

A snapshot of the environment at a given time. Could be a game frame, a robot's joint angles, or a portfolio value. The agent bases its decisions on the current state.

a

Action

A choice the agent makes. Could be "move left", "buy shares", or "apply force". The set of all possible actions is called the action space.

r

Reward

A scalar signal the agent receives after each action. Positive rewards encourage behavior; negative rewards discourage it. The agent's goal is to maximize total reward over time.

Key Concepts

🎯

Policy π

The agent's strategy — a mapping from states to actions. A deterministic policy says "in state s, do action a". A stochastic policy gives a probability distribution over actions.

📈

Return G

The total accumulated reward from time t onwards: Gₚ = rₚ₊₁ + γrₚ₊₂ + γ²rₚ₊₃ + ... where γ (gamma) is the discount factor, making future rewards worth less.

🏠

Episode

A complete sequence from start to terminal state. In a game, one episode = one game. The agent resets and tries again, hopefully doing better each time.

📄

Discount Factor γ

A value in [0, 1] that controls how much the agent cares about future rewards. γ = 0: only immediate reward matters. γ = 0.99: agent is almost fully far-sighted.

Real-World Examples

🎮 Video Games
DeepMind's AlphaGo, OpenAI's Dota 2 bot — agents trained by playing millions of games.

🤖 Robotics
Robot arms learning to grasp objects, legged robots learning to walk — purely through trial and error.

💡 Recommendations
YouTube, Netflix — the recommendation engine is trained to maximize engagement (reward) over sessions.

Chapter 2

Markov Decision Processes

The mathematical language of sequential decision making.

A Markov Decision Process (MDP) is the formal framework underlying most RL problems. It gives us precise language to describe the agent, environment, and their interaction.

Definition

An MDP is a 5-tuple (S, A, P, R, γ):

🌏

S — State Space

The set of all possible situations the agent can be in. Can be finite (grid world) or continuous (robot joint angles).

▶

A — Action Space

All possible actions available to the agent. Can be discrete (left/right/up/down) or continuous (torque values).

🔄

P — Transition Model

P(s'|s, a) — probability of landing in state s' after taking action a in state s. Captures environment dynamics.

🎉

R — Reward Function

R(s, a) — the immediate reward received after taking action a in state s. Encodes what the agent should optimize.

The Markov Property

Key Insight

The Markov property states that the future depends only on the current state, not on the history of how we got there. This is a crucial simplifying assumption that makes RL tractable.

P(sₚ₊₁ | sₚ, aₚ) = P(sₚ₊₁ | s₀, a₀, s₁, a₁, ..., sₚ, aₚ)

The next state depends only on the current state and action — not the full history. This is why we can build compact algorithms.

A Simple MDP Example

Consider a 3-room apartment. The agent (robot) moves between rooms to reach the kitchen (goal).

Why MDP Formulation Matters

Once we have an MDP, we can apply dynamic programming, Monte Carlo, or temporal difference methods to find the optimal policy — the mapping from states to actions that maximizes total reward.

MDP vs Real World

Real problems rarely have a known transition model P(s'|s,a). When we know the model, we can use planning (Dynamic Programming). When we don't, we use model-free RL like Q-learning — the agent learns from experience without needing to know how the world works.

Chapter 3

Value Functions & Bellman Equations

How do we quantify how good a state is? Enter value functions.

The State-Value Function V(s)

The value of a state is the expected total return starting from that state, following policy π:

Vπ(s) = Eπ[ Gₚ | sₚ = s ] = Eπ[ ∑ₚ₀₀∞ γᵈ rₚ₊ₚ₊₁ | sₚ = s ]

Intuitively: if the agent is in state s and plays policy π from now on, how much total reward can it expect?

The Action-Value Function Q(s, a)

The Q-function (quality function) tells us how good it is to take action a in state s:

Qπ(s, a) = Eπ[ Gₚ | sₚ = s, aₚ = a ]

Q is more useful for learning because the optimal policy follows trivially: π*(s) = argmaxₐ Q*(s, a)

The Bellman Equation

The key insight of RL: a value can be expressed recursively in terms of the next state's value:

Vπ(s) = ∑ₐ π(a|s) ∑ₐ₈ P(s'|s,a) [ R(s,a,s') + γ Vπ(s') ]

Qπ(s,a) = ∑ₐ₈ P(s'|s,a) [ R(s,a,s') + γ ∑ₐ₈ π(a'|s') Qπ(s',a') ]

Key Insight

This recursive structure allows us to compute values by bootstrapping: use current estimates to update estimates. This is the core of Q-learning and TD learning.

Interactive: Value Iteration

A 4x4 grid world. Click cells to cycle reward types. Then press Iterate to run one Bellman sweep. Watch values propagate from the goal backwards.

γ = 0.9

Sweeps: 0 Max |ΔV|: —

■ Goal (+10) ■ Trap (-5) ■ Wall (blocked) ■ Empty (0) Click cells to change type

Chapter 4

Q-Learning: A Live Demo

Watch an agent learn to navigate a grid world from scratch, with no knowledge of the environment.

The Q-Learning Algorithm

Q-learning is a model-free, off-policy TD algorithm. It directly learns the optimal Q-function without needing to know the environment dynamics.

Q(s, a) ← Q(s, a) + α [ r + γ maxₐ₈ Q(s', a') − Q(s, a) ]

// Q-Learning Algorithm Initialize Q(s, a) = 0 for all s, a for each episode: s ← start state while s is not terminal: a ← ε-greedy(Q, s) // explore or exploit r, s' ← env.step(s, a) // take action, observe Q(s,a) ← Q(s,a) + α[r + γ·max Q(s',·) - Q(s,a)] s ← s' decay ε // less exploration over time

Live Grid World Demo

The agent (white circle) starts top-left, goal is bottom-right (gold). Walls block movement. Cell color = max Q-value (darker = less known/good, brighter = more promising).

0

Episodes

0

Total Steps

1.00

Epsilon ε

0

Last Reward

Speed: 20 Show policy

Q-value heat: low → high Arrows = greedy policy α = 0.1 γ = 0.99

What to Observe

Early episodes: agent wanders randomly (high ε)
The cell colors lighten as Q-values grow near the goal
Policy arrows eventually point toward the goal from everywhere
As ε decays, the agent stops exploring and follows the learned policy
The agent finds optimal (or near-optimal) paths through the maze

Chapter 5

Exploration vs Exploitation

The fundamental dilemma: should the agent try new things or stick with what works?

Exploitation means using current knowledge to pick the best-known action. Exploration means trying less-known actions to gather information. Too much exploitation: you get stuck at local optima. Too much exploration: you never use what you know.

The Multi-Armed Bandit Problem

Imagine 5 slot machines (bandits), each with a different but unknown expected payout. You get 50 pulls total. Your goal: maximize total reward.

This is the exploration-exploitation dilemma in its purest form — no states, just action selection.

Interactive Bandit Demo

Click the slot machines to pull them. The true reward is hidden — you must discover which machine is best through exploration!

Pull a slot machine to start...

Total pulls: 0 Total reward: 0.00 Best arm: ?

Common Exploration Strategies

🎲

ε-Greedy

With probability ε, pick a random action (explore). Otherwise, pick the best-known action (exploit). Simple and widely used. Decay ε over time as you learn more.

📊

UCB (Upper Confidence Bound)

Choose action with highest Q + c√(ln(t)/N(a)). This naturally explores less-tried actions more, giving "optimism in the face of uncertainty".

🌡

Boltzmann / Softmax

Sample actions proportional to exp(Q(a)/τ). Temperature τ controls randomness. High τ = uniform random, low τ = greedy. Smooth interpolation between the extremes.

👉

Thompson Sampling

Maintain a probability distribution over each arm's true value. Sample from each distribution and pick the highest sample. Bayesian and provably efficient.

UCB: Aₚ = argmaxₐ[ Q(a) + c √(ln t / N(a)) ]

where N(a) is the number of times action a has been tried and t is the total number of steps.

Chapter 6

Policy Gradient Methods

Instead of learning a value function, why not directly optimize the policy?

The Core Idea

Q-learning learns a value function and derives a policy indirectly. Policy gradient methods parameterize the policy directly as πθ(a|s) and optimize parameters θ using gradient ascent on expected return.

J(θ) = Eπθ[ Gₚ ] = Eπθ[ ∑ₚ γᵈ rₚ₊₁ ]

θ ← θ + α ∇θ J(θ) // gradient ASCENT to maximize

The Policy Gradient Theorem

Computing the gradient of J directly is hard — the environment dynamics P are unknown. The policy gradient theorem gives us a tractable form:

∇θ J(θ) = Eπθ[ ∇θ log πθ(a|s) · Qπ(s, a) ]

This is elegant: the gradient is the expected product of the log-probability gradient and the Q-value. We can estimate this from experience without knowing P.

REINFORCE Algorithm

The simplest policy gradient method — uses the full return Gₚ as an estimate of Q:

// REINFORCE (Williams, 1992) Initialize policy πθ (e.g. neural network with params θ) for each episode: Sample trajectory τ = (s₀,a₀,r₁, s₁,a₁,r₂, ..., sₜ) from πθ for each step t: Gₚ ← ∑ γᵉ⁻ᵈ rᵉ₊₁ // return from t θ ← θ + α γᵈ Gₚ ∇θ log πθ(aₚ|sₚ) // update

Why Policy Gradients?

✅

Works with Continuous Actions

Q-learning requires a max over actions — impossible in continuous spaces. Policy gradients handle this naturally by parameterizing πθ(a|s) as a Gaussian distribution.

🔄

Stochastic Policies

Policy gradients naturally represent stochastic policies. This is useful when the optimal policy is non-deterministic (e.g., bluffing in poker requires randomness).

🔥

High Variance

REINFORCE has high variance — return estimates are noisy. Solutions: subtract a baseline, use actor-critic methods, or advantage functions A(s,a) = Q(s,a) - V(s).

🧠

Actor-Critic

Combines value functions (critic estimates V(s)) with policy gradients (actor updates πθ). The critic reduces variance; the actor gets direct policy improvement.

Value-Based vs Policy-Based

Q-Learning (Value-Based)

Learn Q(s,a), derive policy
Deterministic policy (greedy)
Discrete action spaces
Often more sample efficient
DQN, Rainbow, C51

Policy Gradients (Policy-Based)

Directly optimize πθ(a|s)
Stochastic policies
Continuous action spaces
Higher variance, but scalable
PPO, A3C, SAC, TRPO

Chapter 7

Deep RL & Beyond

Combining deep neural networks with RL to tackle complex, high-dimensional problems.

The Problem with Tabular RL

Q-learning stores a table of Q(s, a) values. This works for small discrete state spaces, but fails in complex environments. A single Atari frame has 256^84×84×4 possible states — a table won't fit in the observable universe.

Deep RL replaces the Q-table with a neural network: Q(s, a; θ). The network takes a state (e.g., raw pixels) and outputs Q-values for all actions.

Key Algorithms Timeline

2013

DQN (DeepMind) — First deep RL system to master Atari games from raw pixels. Key innovations: experience replay, target networks to stabilize training.

2015

A3C (Asynchronous Advantage Actor-Critic) — Multiple agents run in parallel, asynchronously updating shared network weights. Faster and more stable than DQN.

2016

AlphaGo — Defeated world Go champion. Combined supervised learning, policy gradients, and Monte Carlo Tree Search.

2017

PPO (Proximal Policy Optimization) — OpenAI's workhorse algorithm. Clips the policy update to prevent destructively large steps. Simple, robust, widely used today.

2018

SAC (Soft Actor-Critic) — Maximum entropy RL. Maximizes reward AND policy entropy (randomness), leading to better exploration and robustness.

2019+

AlphaStar, OpenAI Five, MuZero — Superhuman performance in StarCraft, Dota 2. MuZero learns its own world model without being told the rules.

2022+

RLHF (RL from Human Feedback) — The technique behind ChatGPT/Claude. Train a reward model from human preferences, then use PPO to fine-tune the language model.

DQN: The Core Innovation

// Deep Q-Network (DQN) — simplified Initialize Q-network Q(s,a;θ) and target net Q(s,a;θ⁻) Initialize replay buffer D for each step: a = ε-greedy(Q(s,·;θ)) r, s' = env.step(a) D.push((s, a, r, s')) // store experience (s,a,r,s') = D.sample_batch() // replay for stability y = r + γ maxₐ Q(s',a;θ⁻) // target (fixed params) θ ← θ - α ∇θ (y - Q(s,a;θ))² // gradient descent periodically: θ⁻ ← θ // update target net

Where to Go Next

📚

Sutton & Barto

"Reinforcement Learning: An Introduction" — the canonical textbook, freely available online. Complete mathematical treatment of everything covered here.

🏭

Gymnasium (OpenAI Gym)

The standard Python library for RL environments. CartPole, MountainCar, Atari, MuJoCo. Start here to run your own Q-learning and PPO experiments.

💻

Stable-Baselines3

High-quality PyTorch implementations of PPO, SAC, DQN, A2C. The fastest way to train RL agents without writing algorithms from scratch.

🚀

Deep Mind / OpenAI Papers

Read the original DQN, PPO, AlphaGo papers on arXiv. The field moves fast — following new papers from top labs is how to stay current.

You have completed RL Tutor!

You now understand the full stack: from MDPs and value functions, through Q-learning and policy gradients, to modern deep RL. The best way to solidify this knowledge is to implement: start with tabular Q-learning on CartPole, then try DQN on Atari.