Tutorial Course

COMP 2205 · Machine Learning: Reinforcement Learning

Led by Suttonesque Analysis Simulacrum

5 modules 5 modules Computing Updated 6 days ago

The exploration/exploitation dilemma, Upper Confidence Bound, Thompson Sampling, and the Bellman equation. Based on the writings of Richard Bellman.

If you found this course useful, consider becoming a patron and supporter. Support Universitas Scholarium →

Module 1

The Exploration/Exploitation Dilemma

Led by Suttonesque Analysis Simulacrum

The question
You have 10 ad creatives. You want the best-performing one. Every impression you spend on a bad ad is regret. Every impression you spend exploring is opportunity cost. How do you find the best ad quickly without wasting too many impressions on bad ones?

Outcome
The student can describe the exploration/exploitation trade-off and define regret.
Sub-units
1. ○ 1.1 The Dilemma in Concrete Terms
Module 2

Upper Confidence Bound

Led by Suttonesque Analysis Simulacrum

The question
UCB never exploits if uncertain. For each arm, compute the estimated mean plus an exploration bonus that shrinks as the arm is played more. Choose the highest. Why does this automatically balance exploration and exploitation — and what does convergence look like?

Outcome
The student can implement UCB and compare cumulative reward to random selection.
Sub-units
1. ○ 2.1 Implement UCB
Module 3

Thompson Sampling

Led by Suttonesque Analysis Simulacrum

The question
Maintain a Beta distribution for each arm. Sample from each, pick the argmax. Over time, distributions for well-known arms become narrow; uncertain arms stay wide. Why does this Bayesian approach empirically outperform UCB on most bandit problems?

Outcome
The student can implement Thompson Sampling and compare its convergence to UCB.
Sub-units
1. ○ 3.1 Implement Thompson Sampling
Module 4

Beyond Bandits: The Bellman Equation

Led by Suttonesque Analysis Simulacrum

The question
The bandit problem has no state. Full RL has states, actions, rewards, and transitions. The Bellman equation ties them together. Explain it to a manager with no mathematics — and trace the path from Q-learning to deep Q-networks.

Outcome
The student can state the Bellman equation and describe the MDP framework.
Sub-units
1. ○ 4.1 The Bellman Equation in Plain Terms
Module 5

RL in Production

Led by Suttonesque Analysis Simulacrum

The question
You want to use RL to personalise product recommendations for one million users. What is the minimum responsible deployment? Bandit? Contextual bandit? Full RL? And what constitutes success?

Outcome
The student can identify production RL risks and justify a deployment strategy.
Sub-units
1. ○ 5.1 Final Essay: When Is RL the Right Tool?

COMP 2205 · Machine Learning: Reinforcement Learning

The Exploration/Exploitation Dilemma

Upper Confidence Bound

Thompson Sampling

Beyond Bandits: The Bellman Equation

RL in Production